Open science resources from the Tara Pacific expedition across coral reef and surface ocean ecosystems

The Tara Pacific expedition (2016-2018) sampled coral ecosystems around 32 islands in the Pacific Ocean and the ocean surface waters at 249 locations, resulting in the collection of nearly 58,000 samples. The expedition was designed to systematically study warm coral reefs and included the collection of corals, fish, plankton, and seawater samples for advanced biogeochemical, molecular, and imaging analysis. Here we provide a complete description of the sampling methodology, and we explain how to explore and access the different datasets generated by the expedition. Environmental context data were obtained from taxonomic registries, gazetteers, almanacs, climatologies, operational biogeochemical models, and satellite observations. The quality of the different environmental measures has been validated not only by various quality control steps but also through a global analysis allowing the comparison with known environmental large-scale structures. Such a wide released datasets opens the perspective to address a wide range of scientific questions.

plankton, and seawater. As with previous Tara expeditions 14 , organizing and cross-linking the 116 various measurements is a stepping-stone for open-access science resources following FAIR 117 principles (Findable Accessible Interoperable and Reusable 15 ). In this effort, the strategy adopted 118 by Tara Pacific is to provide open access data and early and full release of the datasets once 119 validated or published. Such an approach ensures a long-lasting preservation, discovery and 120 exploration of data by the scientific community which will certainly lead to new hypotheses and 121 emerging concepts. 122 Here we present an overview of the sampling strategy used to collect coral holobiont in 123 connection with its local, large scale or historical environment. We also provide a critical 124 assessment of the environmental context. We provide the full registries describing the geospatial, 125 temporal, and methodological information for every sample, and connect it to the various sampling 126 events or stations. Extensive environmental context is also provided at the level of samples or 127 stations. Such registries and environmental context collections are essential for researchers to 128 explore the Tara Pacific data and will be updated and complemented when additional datasets will 129 be released to the public. Throughout the entire manuscript, terms stated [within brackets] refers to 130 the terms used within the registry or in environmental context datasets.  (Table 1)    The sampling event sequence and protocols were performed consistently over the whole 168 expedition. Sampling was operated following the same procedure, approximate timing, and 169 articulated around the same standardized "sampling events" (Figure 2) which allowed the same 170 collection of samples with a standardized protocol (Table 2). On rare occasions, the timing and 171 protocols were adapted for sailing conditions and to fit the schedule. Sampling events are 172 characterized by their mode of sampling, which could be either directly from Tara's dinghy 173 [ZODIAC] or directly either using scuba-diving (  CTD probe (Castaway CTD) was also deployed from the dinghy down to the reef (generally ~5 to 197 10m) to record temperature and conductivity profiles. released during the mechanical fragmentation of coral colony. Then, water was pumped using a 217 manual membrane pump onboard Tara's dinghy that was stationary above the coral colony. A scuba 218 diver was holding a clean water tubing next to the colony while the operator onboard the dinghy 219 was pumping the water up to the skiff. First, the water collected was used to rinse the pumping 220 system, as well as a 20 µm metallic sieve and the 50 L carboys that will be used to transport the 221  with a metallic pre-filter of 2 mm mesh size, two debubblers, and a flowmeter to record the volume 500 of water sampled. Unfiltered water was collected first for a series of protocols, water was prefiltered 501 using a 20 µm sieve to rinse and fill two 50 L. Both unfiltered seawater use and 20µm filtered 502 seawater were labelled as [CARBOY]. To collect larger plankton, water was pumped from the 503 DOLPHIN into a 20 µm net fixed on the wetlab's wall ([DECKNET-20]) for 1 to 2 hours depending 504 on biomass concentration simultaneously to a net tow using a "high speed net" ([HSN-NET-300]). 505 The HSN was equipped with 300 µm mesh sized net and designed to be efficient up to 9 knots. It 506 was towed from 60 to 90 minutes depending on the plankton density. Near islands and in the Great

From Dolphin-Decknet 542
Once the [DECKNET-20] time limit reached (between 1 and 2 hours), the flow was stopped 543 and the net was carefully rinsed with 0.2 µm filtered seawater. The plankton sample was then 544 transferred to a 2 L Nalgene bottle and completed to 2 L with 0.2 µm filtered seawater. The sample 545 was homogenized by repeated smooth bottle flips and split into four 250mL subsamples for [S20]*, 546 one 250 mL sample for [E20]*, one 250 mL sample for [LIVE20]*, and one 45 mL sample for 547 [H20]*. In addition to these already described protocols, one 250 mL sample was also taken for 548 [L20], for which the seawater was drained using a 20µm sieve and the plankton was transferred in 549 a 50 mL Falcon tube and fixed with 1 mL of acidic lugol solution for latter microscopic 550 observations. Finally, a 45 mL sample was taken for [F20], transferred in a 50 mL Falcon tube and 551 fixed with 1 mL of 37% formalin solution and completed to 50 mL with sodium tetraborate 552 decahydrate buffer solution for latter microscopic observations. 553

From HSN/Manta nets 555
Once recovered, samples collected both by the HSN net and the Manta net followed the same 556 procedure. The net was carefully rinsed from the exterior to drain organisms into the collector. Its 557 content was transferred using 0.2µm filtered sea water in a 2L Nalgene Bottle and completed to 2L. 558 The sample was then homogenized and split in two 1L samples. The first half was prefiltered onto 559 a 2mm metallic sieve and filtered onto four 47mm 10µm pore size polycarbonate membranes 560 (250mL each), Filters were then placed into 5mL cryotubes, flash frozen and conserved in liquid 561 nitrogen for latter sequencing ([S300]). The second fraction was concentrated onto a 200µm sieve 562 and resuspended in a 250mL double closure bottle using filtered seawater saturated with sodium 563 tetraborate decahydrate, fixed with 30mL of 37% formalin solution and stored at room temperature 564 for latter taxonomic and morphological analysis using imaging methods ([F300]). 565       Missing value terms are: "nav" = not-available, i.e. the expected information is not given because 762 it has not been collected or generated; "npr" = not-provided, i.e. the expected information has been 763 collected or generated but it is not given, i.e. a value may be available in a later version or may be 764 obtained by contacting the data providers; "nac" = confidential, i.e. the expected information has 765 been collected or generated but is not available openly because of privacy concerns; "nap" = not-766 applicable, i.e. no information is expected for this combination of parameter, environment and/or 767 method, e.g. depth below seabed cannot be informed for a sample collected in the water or the 768 atmosphere 769 770 (4) Simplified version at site level 771 In some cases, certain parameters were not available at specific sampling sites due to technical 772 issues or sensor availability, however, various basin scale studies and statistical tests require a 773 complete dataset for all sampled sites. During the Tara Pacific expedition, many parameters were 774 concurrently measured in-situ, estimated from remote sensing and/or modelled. For instance, sea 775 surface temperature was measured on the boat using the thermosalinograph included in the 776 underway system, but also with satellite and estimated from a model. Each of these three modes of 777 acquisition have their caveat and accuracy, however, within a certain confidence interval, missing 778 in-situ data can be replaced by its remotely sensed or modelled equivalent. We provide here a 779 simplified version at the sampling site level by replacing missing in-situ data by their closest and 780 most accurate satellite or modelled equivalent. In each case, in-situ data was considered as the most 781 accurate source of data, with a preference to HPLC pigments analysis followed by measurements 782 done by the ACS, while satellite and modelled data were used only if in-situ data was not available. 783 We evaluated the accuracy of ACS and of each satellite and modelled datasets by linear regressions 784 with their in-situ counterparts. A bias of the modelled or satellite data was identified when the slope 785 of the regression was different to 1 and/or an intercept was different to 0. The satellite and modelled 786 data were forced to match the in-situ data by dividing by the slope and subtracting the intercept. 787 This is the case for SST. When large bias persisted between matchups with observations, the 788 corrected data was not used to replace missing in-situ data. This is the case for chl. The same 789 approach was then applied to fill missing data with modelled values (MERCATOR-Copernicus). 790 A correction for the bias in the following variable was applied for SST, SSS, PO4, and SiOH. As 791 previously done, if large bias persisted between observations and corrected data, they were not used 792 to replace missing in-situ data. This is the case for chl, NO3, and Fe. 793 The [MTE] samples were sometimes sampled in the afternoon instead of the morning alongside all 794 the other water samples, thus were located in between two sampling stations. These [MTE] samples 795 could not be assigned to a sampling station following the criterion presented in the section 3, 796 therefore, the missing values of the corresponding morning stations were interpolated linearly. 797 The same approach was used for pH measurements, with a preference from measurements provided 798 by total carbonate system quantifications, followed by direct pH measurements and then modeled 799 values (MERCATOR-Copernicus). 800

(5) Lagrangian and Eulerian diagnostics 802
In order to provide a description of the dynamical properties of the water masses sampled, 803 different Eulerian and Lagrangian diagnostics were calculated. Here, we report a general 804 description of the information each of them provides. In the next subsection, we provide the details 805 of how they were calculated for each station. 806 The following Eulerian diagnostics were calculated: Absolute velocity ([Uabs], m s -1 ): 807 sqrt(u 2 +v 2 ), where u and v are the zonal and meridional components of the horizontal velocity field 808 days a water mass has spent inside an eddy in the previous period. If the water mass is outside an 826 eddy, then its retention time is set to zero. 827

(5.1) Extraction of the Eulerian and Lagrangian diagnostics 828
For each of the 246 stations sampled, we proceeded as follows. 829 We identified the water mass sampled at the given station. This was considered as a stadium 830 shape with the two semi-circles centered on the starting and ending points of the transect, 831 respectively. The radius of the stadium semi-circles was considered 0.1°, which is in accordance 832 with previous studies 49,53,54 . The stadium was filled with virtual particles separated by 0.01°. 833 For each virtual particle inside the stadium shape, we calculated a Eulerian or Lagrangian 834 diagnostic (described above). The Eulerian diagnostics were extracted directly from the velocity 835 field of the day of sampling. Concerning the Lagrangian diagnostics, these were obtained by 836 advecting the virtual particle backward in time for an amount of time from the day of sampling 837 day_S. For the Lagrangian betweenness, the advection was performed between day_S+ /2 and 838 day_S-/2, so that the advective time window was centered on the sampling day (details in 49 ). 839 For the Lagrangian diagnostics, we used the following advective times : 5, 10, 15, 20, 30, and 840 60 days. The only exception is the retention time, which, by construction, was calculated only with 841 the largest advective time, namely =60 days. 842 Once that, a given diagnostic (Eulerian or Lagrangian) was calculated for all the virtual particles 843 filling the stadium shape, we calculated the mean value, and the 25, 50, and 75 percentiles. The 844 percentiles were calculated in order to quantify the spatial variation of the diagnostic inside the 845 stadium shape. Therefore, we associated each station with four values (mean, 25, 50, and 75 846 percentiles) of a given diagnostic. 847 Furthermore, two different velocity fields were used, which are described as follows. 848  Each time series was first averaged on a Julian day basis to provide a seasonal average. This 872 yearly seasonal average was triplicated and concatenated into a 3-year seasonal cycle to apply a 873 digital low pass filter on the middle year without generating artifacts. A digital low pass filter (filter 874 order 3, pass band ripple 0.1; "filfilt" function in matlab) with 36 Julian days windows was applied 875 to the concatenated time series to remove high frequency noise. The middle year was then extracted 876 from the concatenated time series to recover the seasonal cycle. The sea surface temperature 877 anomaly was calculated as the SST minus the seasonal cycle over the full time series. Considering 878 the short periods of missing data (mean of the 95th percentile of the duration of consecutive days 879 with missing data: 9.8 ± 4.1 days), the missing values in the SST and SST anomaly time series were 880 linearly interpolated in order to calculate thermal stress indices. The SST anomaly frequency was 881 calculated as the number of days over the past 52 weeks when the SST anomaly is greater than or 882 equal to 1 °C. Thermal stress indices relevant to coral reef health were then calculated using 883 methodology developed for the Coral Reef Temperature Anomaly Database (CoRTAD) data base 58 884 (Table 4). Events of cold temperature accumulation were also reported to cause bleaching and 885 mortality 59,60 , therefore, the same set of indices were calculated for cold stress adapting the 886 CoRTAD method, but using the minimum weekly climatologies (  Photo analysis for the genus validation and environmental context was conducted using 917 Matlab with code developed and written specifically for the Tara Pacific Expedition 63 . Photos were 918 annotated individually, and annotations were conducted from January to April 2020. To prevent 919 observer bias, photos were randomized, and the annotator was blind to any information regarding 920 the location or the sampling site. The analysis included 1) identification to the genus level, 2) algal 921 contact with types of algal genus if identifiable (Halimeda, Turbinaria 18S sequence by aligning to the NCBI 'nt' database with taxonomic labels. A 'lowest common 939 ancestor' approach was used when there were multiple best hits. These alignment-based annotations 940 were verified phylogenetically (i.e. taxonomic similarity agreed with sequence similarity). More 941 than half of the samples were not annotated at genus or better level using this approach, due to the 942 lack of resolution of the 18S V9 marker. Where available, host taxonomic assignments were based 943 on photo annotations. Otherwise, 18S-based annotations were used. 944 945 946

Technical Validation 947
Numerous steps of quality control were operated at different levels of acquisition to ensure good 948 quality of the different datasets and may vary depending on the type of measurement operated and 949 if it originates from sensors on-board or from samples. 950 951 Inline measurements, models and satellite data validity 952 [PAR] measurement validity was checked by first removing physically wrong data (ie. values 953 greater than 0.45 μE/cm2/sec or lower than 0 μE/cm2/sec) and compared with clear sky matchup 954 measurements from MODIS-Aqua & Terra. Comparison confirmed the good agreement between 955 datasets but also the absence of sensor drift. Temperature and salinity were acquired by the [TSG]. 956 The quality of the whole time series was manually checked, and the temperature validity was 957 assessed by comparing the temperature reading of the two sensors placed at two different places 958 along the inline system. Potential drifts of the temperature sensor was investigated by comparing 959 the temperature time series with satellites' sea surface temperature. Salinity measurements where 960 intercalibrated against unfiltered seawater samples [SAL] taken every week from the surface ocean, 961 and corrected for any observed bias. Moreover, temperature and salinity measurements were 962 validated against Argo floats data collocated with Tara. The [ACS] absorption and attenuation 963 signal due to dissolved matter, drift, and biofouling were estimated between two filter events by 964 interpolating filtered water absorption and attenuation following the shape of the [fdom] from the 965 [WSCD], when available. This method improves data quality in case of strong variation of 966 dissolved matter absorption that the frequency of filter event would not capture properly (e.g. 967 approaching coastal waters or entering a lagoon). When [fdom] data was not available, the filtered 968 absorption and attenuation were linearly interpolated between filter events before being remove 969 from the total absorption and attenuation. From November 13, 2016 to May 6, 2017, the [BB3] was 970 located upstream of the switch system, thus measured total (non-filtered) water all the time. During 971 this period, the volume scattering coefficient of seawater was removed from the raw data counts to 972 obtain the particulate backscattering coefficient [bbp]. The biofouling and instrument drift were 973 estimated comparing values before and after each cleaning events. The biofouling was estimated 974 between two cleaning events by fitting an exponential or linear model to the raw data before 975 removing it from the signal. We advocate to use this period with caution as the data was corrected 976 with theoretical assumptions (i.e. pure seawater scattering and linear or exponential biofouling) that 977 may differ from reality. From May 7 th 2017 to the end of the expedition, the [BB3] was located 978 downstream of the filter-switch system so that, like for the [