diff --git a/paper/figs/raw_corr_combined.png b/paper/figs/raw_corr_combined.png new file mode 100644 index 0000000..a04da56 Binary files /dev/null and b/paper/figs/raw_corr_combined.png differ diff --git a/paper/figs/raw_corr_insample.png b/paper/figs/raw_corr_insample.png new file mode 100644 index 0000000..316635f Binary files /dev/null and b/paper/figs/raw_corr_insample.png differ diff --git a/paper/figs/raw_corr_oos.png b/paper/figs/raw_corr_oos.png new file mode 100644 index 0000000..9324311 Binary files /dev/null and b/paper/figs/raw_corr_oos.png differ diff --git a/paper/main.pdf b/paper/main.pdf index 74ea2f7..f6cf139 100644 Binary files a/paper/main.pdf and b/paper/main.pdf differ diff --git a/paper/main.tex b/paper/main.tex index ad9cd71..57c4cc5 100644 --- a/paper/main.tex +++ b/paper/main.tex @@ -395,6 +395,209 @@ $(P, \varphi)$ to avoid local minima. \section{Results} \label{sec:results} +\subsection{Raw Pairwise Correlations Between CR, Seismic, and Sunspot Data} +\label{sec:res:rawcorr} + +Before applying any detrending, we characterise the raw statistical relationships +between all three observable time series across the three analysis windows. +This serves as the pre-detrending baseline that motivates the Hodrick--Prescott +filter analysis described in Section~\ref{sec:res:detrend}. + +\subsubsection{CR index: station distribution} + +The global CR index is not collapsed to a single mean. +Instead, for each 5-day bin $t$ we compute the distribution of normalised +count-rate values across all $n_t \geq 3$ contributing stations, +yielding five order statistics: +$\hat{x}_{t}^{(q)} = q\text{-th percentile}$ for $q \in \{5, 25, 50, 75, 95\}$, +together with the inter-station minimum and maximum. +All stations are normalised by their individual long-run mean before the +cross-station percentile is taken, so the station-median series has a grand +mean near unity. +The scatter panels in Figures~\ref{fig:rawinsample}--\ref{fig:rawcombined} +use $\hat{x}_{t}^{(50)}$ as the central CR value (horizontal axis) and display +the $[\hat{x}_{t}^{(5)}, \hat{x}_{t}^{(95)}]$ inter-station spread as +horizontal error bars, with the $[\hat{x}_{t}^{\min}, \hat{x}_{t}^{\max}]$ +range overlaid in a lighter shade. + +\subsubsection{Seismic energy metric} + +The seismic activity per bin is measured by the total released seismic energy, +computed as the sum of the individual earthquake energies: +\begin{equation} + E_t = \sum_{i \in \mathcal{B}_t} 10^{1.5\, M_{W,i}}, + \label{eq:energy} +\end{equation} +where $\mathcal{B}_t$ is the set of $M_W \geq 4.5$ events falling in bin $t$. +Working with $\log_{10} E_t$ removes the extreme skewness of $E_t$ and is +physically preferable to summing $M_W$ directly, which is dimensionally +inconsistent because magnitude is already a logarithmic quantity. +The $\log_{10} E_t$ axis in the scatter panels spans roughly three orders of +magnitude, reflecting the heavy-tailed Gutenberg--Richter distribution of +earthquake sizes. + +\subsubsection{Sunspot number} + +Daily sunspot numbers from SILSO are smoothed with a 365-day rolling mean to +suppress intra-year variability and isolate the solar-cycle envelope. +Each 5-day bin carries both the smoothed value (scatter-plot axis) and the +daily min--max spread within that bin (shown as additional horizontal error +bars on the sunspot axis in Figure~\ref{fig:rawinsample}--\ref{fig:rawcombined}, +panel 3). + +\subsubsection{Correlation results} + +We compute both Pearson $r$ (linear; Fisher $z$-transform 95\% CI) and +Spearman $\rho$ (rank-based; appropriate for the heavy-tailed marginal +distributions of $E_t$ and CR flux). +Bonferroni correction for $3\ \text{pairs} \times 3\ \text{windows} = 9$ +tests is applied; star levels refer to the corrected $p$-values. +Full results are given in Table~\ref{tab:rawcorr}. + +\paragraph{CR vs.\ seismicity.} +The raw Pearson correlation between the station-median CR index and +$\log_{10} E_t$ is $r = 0.057$ in the in-sample window ($N = 3214$ bins, +$p_\text{Bonf} = 0.011$, $\rho = 0.071$, $p_\text{Bonf} < 10^{-3}$) and +$r = 0.046$ in the OOS window ($N = 390$, $p_\text{Bonf} = 1.0$). +The correlation is thus only marginally significant after correction and +disappears entirely in the independent OOS window. +Using the station-95th-percentile CR series in place of the median +amplifies the correlation slightly ($r = 0.069$ in-sample), suggesting that +high-flux excursions drive the signal. + +\paragraph{CR vs.\ sunspot number.} +The dominant raw relationship in the dataset is a strong anti-correlation +between the CR index and the smoothed sunspot number: $r = -0.820$ +($\rho = -0.854$) in-sample, and $r = -0.939$ ($\rho = -0.951$) in the OOS +window. +This is the well-established \emph{Forbush decrease} mechanism: a higher +heliospheric magnetic flux during solar maximum deflects more galactic cosmic +rays before they reach the Earth's atmosphere +\citep{Potgieter2013}, so CR flux and solar activity are naturally anti-phase. +The stronger OOS value ($-0.94$ vs.\ $-0.82$) reflects the wide dynamic range +of Solar Cycle~25 (2020--2025), which reached high activity levels after the +deep minimum of 2019--2020. + +\paragraph{Sunspot vs.\ seismicity.} +The raw correlation between the smoothed sunspot number and seismic energy is +negative but small: $r = -0.095$ ($\rho = -0.099$) in-sample +($p_\text{Bonf} < 10^{-6}$) and indistinguishable from zero in the OOS window +($r = -0.023$, $p_\text{Bonf} = 1.0$). +The negative sign is consistent with reports of a weak inverse relationship +between solar activity and global seismicity +\citep{Odintsov2006}, though the OOS failure indicates this effect is not +robust. + +\subsubsection{Interpretation: a confounding triangle} + +The three raw correlations form a confounding triangle. +The CR--sunspot anti-correlation is strong ($r \approx -0.82$ to $-0.94$) +and physically understood. +The sunspot--seismicity anti-correlation is weak but significant in-sample +($r \approx -0.095$). +Together these imply a spurious \emph{positive} CR--seismicity correlation: +both CR and seismicity are jointly modulated by the $\sim$11-year solar cycle +--- CR increases during solar minima, and seismicity may very weakly increase +during solar minima --- so the two series covary positively without any direct +physical connection. +This confound cannot be resolved by naive cross-correlation; it requires +explicitly removing the shared solar-cycle component. +The HP filter analysis (Section~\ref{sec:res:detrend}) accomplishes exactly +this, and shows that the apparent CR--seismicity signal vanishes once the +solar-cycle trend is eliminated. + +These raw correlations are intentionally uncontrolled for solar-cycle +modulation. +They are presented here as the pre-detrending baseline to document the scale +of the confound before any correction is applied, and to highlight that even +the uncorrected CR--seismicity correlation ($r = 0.057$) is far weaker than +the CR--sunspot anti-correlation ($r = -0.82$) that drives it. + +\begin{table}[htbp] + \centering + \caption{Raw pairwise correlation statistics across three time windows. + Bonferroni correction applied for $3 \times 3 = 9$ tests. + CR uses the per-bin station-median index ($\hat{x}^{(50)}$). + Seismic energy is $\log_{10}\!\left(\sum 10^{1.5 M_W}\right)$. + Sunspot is the 365-day smoothed daily count. + $^{*}p_\text{Bonf}<0.05$, $^{**}p_\text{Bonf}<0.01$, + $^{***}p_\text{Bonf}<0.001$.} + \label{tab:rawcorr} + \setlength{\tabcolsep}{4pt} + \begin{tabular}{llrrrrrrr} + \toprule + Pair & Window & $N$ & + $r$ & 95\% CI & + $p$ (raw) & $p$ (Bonf.) & + $\rho$ & $p_\rho$ (Bonf.) \\ + \midrule + CR vs Seismicity & In-sample & 3214 & $+0.057^{*}$ & $[+0.023, +0.092]$ & $0.0012$ & $0.011$ & $+0.071^{***}$ & $<10^{-3}$ \\ + & OOS & 390 & $+0.046$ & $[-0.053, +0.145]$ & $0.362$ & $1.00$ & $+0.018$ & $1.00$ \\ + & Combined & 3604 & $+0.055^{**}$ & $[+0.023, +0.088]$ & $0.0009$ & $0.008$ & $+0.065^{***}$ & $<10^{-3}$ \\ + \addlinespace + CR vs Sunspot & In-sample & 3109 & $-0.820^{***}$ & $[-0.831, -0.808]$ & $\approx 0$ & $\approx 0$ & $-0.854^{***}$ & $\approx 0$ \\ + & OOS & 385 & $-0.939^{***}$ & $[-0.950, -0.926]$ & $\approx 0$ & $\approx 0$ & $-0.951^{***}$ & $\approx 0$ \\ + & Combined & 3494 & $-0.815^{***}$ & $[-0.826, -0.804]$ & $\approx 0$ & $\approx 0$ & $-0.844^{***}$ & $\approx 0$ \\ + \addlinespace + Sunspot vs Seismicity & In-sample & 3109 & $-0.095^{***}$ & $[-0.130, -0.060]$ & $1.1 \times 10^{-7}$ & $10^{-6}$ & $-0.099^{***}$ & $<10^{-6}$ \\ + & OOS & 385 & $-0.023$ & $[-0.123, +0.077]$ & $0.648$ & $1.00$ & $-0.016$ & $1.00$ \\ + & Combined & 3494 & $-0.086^{***}$ & $[-0.119, -0.053]$ & $3.4 \times 10^{-7}$ & $3 \times 10^{-6}$ & $-0.086^{***}$ & $<10^{-6}$ \\ + \addlinespace + \bottomrule + \end{tabular} + + \medskip + \raggedright + \textit{Note:} CR$_{p95}$ variant (station 95th-percentile instead of median) + gives qualitatively identical structure; full results in + \texttt{results/raw\_pairwise\_correlations.json}. +\end{table} + +\begin{figure}[htbp] + \centering + \includegraphics[width=\textwidth]{raw_corr_insample.png} + \caption{Raw pairwise scatter plots for the in-sample window (1976--2019, + $N = 3{,}215$ five-day bins). + \textbf{Left}: CR station-median index vs $\log_{10}$ seismic energy; horizontal + error bars span the station $[\hat{x}^{(5)}, \hat{x}^{(95)}]$ spread. + \textbf{Centre}: CR index vs 365-day smoothed sunspot number; the strong + anti-correlation ($r = -0.82$) reflects the Forbush decrease mechanism. + \textbf{Right}: Smoothed sunspot number vs $\log_{10}$ seismic energy; thin + horizontal error bars show the daily sunspot spread within each 5-day bin. + All points are coloured by decimal year (plasma colormap), revealing the + solar-cycle drift: successive cycles trace the same anti-correlated arc in + the centre panel. + Black curves are LOWESS trend lines ($f = 0.4$).} + \label{fig:rawinsample} +\end{figure} + +\begin{figure}[htbp] + \centering + \includegraphics[width=\textwidth]{raw_corr_oos.png} + \caption{Raw pairwise scatter plots for the out-of-sample window + (2020--2025, $N = 390$ bins, 27 NMDB stations). + Layout identical to Figure~\ref{fig:rawinsample}. + The CR--sunspot anti-correlation strengthens to $r = -0.939$ during Solar + Cycle~25, which had a particularly wide dynamic range. + The CR--seismicity correlation ($r = 0.046$, $p_\text{Bonf} = 1.0$) is + indistinguishable from zero.} + \label{fig:rawoos} +\end{figure} + +\begin{figure}[htbp] + \centering + \includegraphics[width=\textwidth]{raw_corr_combined.png} + \caption{Raw pairwise scatter plots for the full combined window + (1976--2025, $N = 3{,}604$ bins). + The multi-decadal colour gradient in the centre panel shows CR and sunspot + oscillating in anti-phase across four complete solar cycles. + The overall CR--seismicity Pearson correlation ($r = 0.055$, $\rho = 0.065$) + is statistically significant ($p_\text{Bonf} = 0.008$) but quantitatively + negligible, and its sign follows directly from the confounding triangle + described in the text.} + \label{fig:rawcombined} +\end{figure} + \subsection{In-Sample Replication (1976--2019)} \label{sec:res:insample} @@ -661,7 +864,7 @@ at $\tau = -525$~days. Even setting aside the statistical issues, the proposed mechanism faces severe physical constraints. The total ionisation dose from galactic CRs at the surface is -$\sim$\SI{0.3}{\milli\gray\per\year} \citep{Aplin2005} --- far too small to +$\sim 0.3$~mGy/year \citep{Aplin2005} --- far too small to transfer meaningful mechanical energy to fault zones, which require shear-stress changes of order $\sim$0.01--1~MPa to trigger earthquakes. Proposed mechanisms via radon ionisation \citep{Pulinets2004} or nuclear diff --git a/results/figs/raw_corr_combined.png b/results/figs/raw_corr_combined.png new file mode 100644 index 0000000..a04da56 Binary files /dev/null and b/results/figs/raw_corr_combined.png differ diff --git a/results/figs/raw_corr_insample.png b/results/figs/raw_corr_insample.png new file mode 100644 index 0000000..316635f Binary files /dev/null and b/results/figs/raw_corr_insample.png differ diff --git a/results/figs/raw_corr_oos.png b/results/figs/raw_corr_oos.png new file mode 100644 index 0000000..9324311 Binary files /dev/null and b/results/figs/raw_corr_oos.png differ diff --git a/results/raw_pairwise_correlations.json b/results/raw_pairwise_correlations.json new file mode 100644 index 0000000..be8369c --- /dev/null +++ b/results/raw_pairwise_correlations.json @@ -0,0 +1,200 @@ +{ + "n_tests_bonferroni": 9, + "results": [ + { + "label": "CR_p50 vs Seismic", + "window": "In-sample (1976\u20132019)", + "n_bins": 3214, + "pearson_r": 0.05710710378868533, + "pearson_p": 0.00119988073256791, + "pearson_ci_lo": 0.02257726973505399, + "pearson_ci_hi": 0.09150085205340468, + "pearson_p_bonf": 0.010798926593111189, + "spearman_rho": 0.07097861545637706, + "spearman_p": 5.63861918381612e-05, + "spearman_p_bonf": 0.0005074757265434508 + }, + { + "label": "CR_p95 vs Seismic", + "window": "In-sample (1976\u20132019)", + "n_bins": 3214, + "pearson_r": 0.06907963019575922, + "pearson_p": 8.875791465767054e-05, + "pearson_ci_lo": 0.03458782771353293, + "pearson_ci_hi": 0.1034070656655567, + "pearson_p_bonf": 0.0007988212319190349, + "spearman_rho": 0.0772244192684692, + "spearman_p": 1.1713062375821977e-05, + "spearman_p_bonf": 0.0001054175613823978 + }, + { + "label": "CR_p50 vs Sunspot", + "window": "In-sample (1976\u20132019)", + "n_bins": 3109, + "pearson_r": -0.8196118840402077, + "pearson_p": 0.0, + "pearson_ci_lo": -0.8308273920487026, + "pearson_ci_hi": -0.8077309154215488, + "pearson_p_bonf": 0.0, + "spearman_rho": -0.8537788002205909, + "spearman_p": 0.0, + "spearman_p_bonf": 0.0 + }, + { + "label": "CR_p95 vs Sunspot", + "window": "In-sample (1976\u20132019)", + "n_bins": 3109, + "pearson_r": -0.7111385470496709, + "pearson_p": 0.0, + "pearson_ci_lo": -0.7280904984781218, + "pearson_ci_hi": -0.6933173005311596, + "pearson_p_bonf": 0.0, + "spearman_rho": -0.7120845180015138, + "spearman_p": 0.0, + "spearman_p_bonf": 0.0 + }, + { + "label": "Sunspot vs Seismic", + "window": "In-sample (1976\u20132019)", + "n_bins": 3109, + "pearson_r": -0.09493087002363143, + "pearson_p": 1.1388681609654831e-07, + "pearson_ci_lo": -0.12965168680195338, + "pearson_ci_hi": -0.059977540109432424, + "pearson_p_bonf": 1.0249813448689347e-06, + "spearman_rho": -0.09896108460051863, + "spearman_p": 3.215656878418318e-08, + "spearman_p_bonf": 2.894091190576486e-07 + }, + { + "label": "CR_p50 vs Seismic", + "window": "OOS (2020\u20132025)", + "n_bins": 390, + "pearson_r": 0.046285662686090766, + "pearson_p": 0.3619670906211286, + "pearson_ci_lo": -0.053261360369158024, + "pearson_ci_hi": 0.14492178269732792, + "pearson_p_bonf": 1.0, + "spearman_rho": 0.018431820974901467, + "spearman_p": 0.7167081698328611, + "spearman_p_bonf": 1.0 + }, + { + "label": "CR_p95 vs Seismic", + "window": "OOS (2020\u20132025)", + "n_bins": 390, + "pearson_r": 0.01741112912354181, + "pearson_p": 0.731775599256422, + "pearson_ci_lo": -0.08203292234056903, + "pearson_ci_hi": 0.1165119037773267, + "pearson_p_bonf": 1.0, + "spearman_rho": 0.005527877339000357, + "spearman_p": 0.913347440746341, + "spearman_p_bonf": 1.0 + }, + { + "label": "CR_p50 vs Sunspot", + "window": "OOS (2020\u20132025)", + "n_bins": 385, + "pearson_r": -0.9386922203247808, + "pearson_p": 3.195693957272318e-179, + "pearson_ci_lo": -0.9495525580837112, + "pearson_ci_hi": -0.925583111407333, + "pearson_p_bonf": 2.876124561545086e-178, + "spearman_rho": -0.9509485215521011, + "spearman_p": 2.976672353983517e-197, + "spearman_p_bonf": 2.679005118585165e-196 + }, + { + "label": "CR_p95 vs Sunspot", + "window": "OOS (2020\u20132025)", + "n_bins": 385, + "pearson_r": -0.8009259368888886, + "pearson_p": 2.4808686858547665e-87, + "pearson_ci_lo": -0.8341024755606444, + "pearson_ci_hi": -0.7619757168998428, + "pearson_p_bonf": 2.2327818172692897e-86, + "spearman_rho": -0.8577912916660049, + "spearman_p": 9.430786501080505e-113, + "spearman_p_bonf": 8.487707850972454e-112 + }, + { + "label": "Sunspot vs Seismic", + "window": "OOS (2020\u20132025)", + "n_bins": 385, + "pearson_r": -0.023356101158924312, + "pearson_p": 0.6477774446611007, + "pearson_ci_lo": -0.12301462492345265, + "pearson_ci_hi": 0.07676878530250154, + "pearson_p_bonf": 1.0, + "spearman_rho": -0.016183406260000095, + "spearman_p": 0.7516005849333192, + "spearman_p_bonf": 1.0 + }, + { + "label": "CR_p50 vs Seismic", + "window": "Combined (1976\u20132025)", + "n_bins": 3604, + "pearson_r": 0.055280515666118794, + "pearson_p": 0.0008999788202432366, + "pearson_ci_lo": 0.022671514404173917, + "pearson_ci_hi": 0.08777201673979473, + "pearson_p_bonf": 0.008099809382189129, + "spearman_rho": 0.06450971045346121, + "spearman_p": 0.00010643839785029366, + "spearman_p_bonf": 0.0009579455806526429 + }, + { + "label": "CR_p95 vs Seismic", + "window": "Combined (1976\u20132025)", + "n_bins": 3604, + "pearson_r": 0.06534669803613433, + "pearson_p": 8.643658108903141e-05, + "pearson_ci_lo": 0.03276668669712753, + "pearson_ci_hi": 0.09778798224460038, + "pearson_p_bonf": 0.0007779292298012827, + "spearman_rho": 0.07455587634565956, + "spearman_p": 7.446103364283088e-06, + "spearman_p_bonf": 6.701493027854779e-05 + }, + { + "label": "CR_p50 vs Sunspot", + "window": "Combined (1976\u20132025)", + "n_bins": 3494, + "pearson_r": -0.8151123489946459, + "pearson_p": 0.0, + "pearson_ci_lo": -0.8259476899624344, + "pearson_ci_hi": -0.8036749974749445, + "pearson_p_bonf": 0.0, + "spearman_rho": -0.8441554235717627, + "spearman_p": 0.0, + "spearman_p_bonf": 0.0 + }, + { + "label": "CR_p95 vs Sunspot", + "window": "Combined (1976\u20132025)", + "n_bins": 3494, + "pearson_r": -0.7288431646136924, + "pearson_p": 0.0, + "pearson_ci_lo": -0.7440213093422423, + "pearson_ci_hi": -0.7129131861565221, + "pearson_p_bonf": 0.0, + "spearman_rho": -0.7424462187819917, + "spearman_p": 0.0, + "spearman_p_bonf": 0.0 + }, + { + "label": "Sunspot vs Seismic", + "window": "Combined (1976\u20132025)", + "n_bins": 3494, + "pearson_r": -0.08618292707351498, + "pearson_p": 3.3617107739195996e-07, + "pearson_ci_lo": -0.11900279156634344, + "pearson_ci_hi": -0.05317493830432684, + "pearson_p_bonf": 3.0255396965276398e-06, + "spearman_rho": -0.08626578684354677, + "spearman_p": 3.275322308474069e-07, + "spearman_p_bonf": 2.9477900776266624e-06 + } + ] +} \ No newline at end of file diff --git a/results/raw_pairwise_table.tex b/results/raw_pairwise_table.tex new file mode 100644 index 0000000..89c2f4b --- /dev/null +++ b/results/raw_pairwise_table.tex @@ -0,0 +1,38 @@ +% Auto-generated by 09_raw_pairwise_correlations.py +\begin{table}[htbp] + \centering + \caption{Raw pairwise correlation statistics across three time windows. + Bonferroni correction applied for $3 \times 3 = 9$ tests. + CR uses the per-bin station-median index. + Seismic energy is $\log_{10}\!\left(\sum 10^{1.5 M_W}\right)$. + Sunspot is the 365-day smoothed daily count. + $^{*}p_\text{Bonf}<0.05$, $^{**}p_\text{Bonf}<0.01$, + $^{***}p_\text{Bonf}<0.001$.} + \label{tab:rawcorr} + \setlength{\tabcolsep}{4pt} + \begin{tabular}{llrrrrrrr} + \toprule + Pair & Window & $N$ & + $r$ & 95\% CI & + $p$ (raw) & $p$ (Bonf.) & + $\rho$ & $p_\rho$ (Bonf.) \\ + \midrule + CR (med.) vs Seismicity & In-sample (1976--2019) & 3214 & 0.057$^{*}$ & [0.023, 0.092] & 0.0012 & 0.0108 & 0.071$^{***}$ & 0.000507 \\ + & OOS (2020--2025) & 390 & 0.046 & [-0.053, 0.145] & 0.362 & 1 & 0.018 & 1 \\ + & Combined (1976--2025) & 3604 & 0.055$^{**}$ & [0.023, 0.088] & 0.0009 & 0.0081 & 0.065$^{***}$ & 0.000958 \\ + \addlinespace + CR (med.) vs Sunspot & In-sample (1976--2019) & 3109 & -0.820$^{***}$ & [-0.831, -0.808] & 0 & 0 & -0.854$^{***}$ & 0 \\ + & OOS (2020--2025) & 385 & -0.939$^{***}$ & [-0.950, -0.926] & 3.2e-179 & 2.88e-178 & -0.951$^{***}$ & 2.68e-196 \\ + & Combined (1976--2025) & 3494 & -0.815$^{***}$ & [-0.826, -0.804] & 0 & 0 & -0.844$^{***}$ & 0 \\ + \addlinespace + Sunspot vs Seismicity & In-sample (1976--2019) & 3109 & -0.095$^{***}$ & [-0.130, -0.060] & 1.14e-07 & 1.02e-06 & -0.099$^{***}$ & 2.89e-07 \\ + & OOS (2020--2025) & 385 & -0.023 & [-0.123, 0.077] & 0.648 & 1 & -0.016 & 1 \\ + & Combined (1976--2025) & 3494 & -0.086$^{***}$ & [-0.119, -0.053] & 3.36e-07 & 3.03e-06 & -0.086$^{***}$ & 2.95e-06 \\ + \addlinespace + \bottomrule + \end{tabular} + \bigskip + + \textit{Note: CR\textsubscript{p95} variant (station 95th percentile instead of median) + gives similar structure; see \texttt{results/raw\_pairwise\_correlations.json}.} +\end{table} diff --git a/scripts/09_raw_pairwise_correlations.py b/scripts/09_raw_pairwise_correlations.py new file mode 100644 index 0000000..0cadb69 --- /dev/null +++ b/scripts/09_raw_pairwise_correlations.py @@ -0,0 +1,860 @@ +#!/usr/bin/env python3 +""" +09_raw_pairwise_correlations.py +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Raw pairwise correlations between galactic cosmic-ray flux (CR), global +seismicity, and solar activity (sunspot number) across three time windows: + + in-sample : 1976-01-01 to 2019-12-31 + OOS : 2020-01-01 to 2025-04-29 + combined : 1976-01-01 to 2025-04-29 + +No HP filtering or detrending is applied. Missing OOS data (NMDB 2020-2025, +USGS 2020-2025) are downloaded automatically. + +CR is represented by its per-bin distribution across NMDB stations (p5, p50, +p95, min, max). Seismic energy uses the physically correct E ∝ 10^(1.5·Mw) +sum. Two CR variants are correlated: the station-median (p50) and station-p95. + +Outputs +------- +results/raw_pairwise_correlations.json +results/figs/raw_corr_insample.png +results/figs/raw_corr_oos.png +results/figs/raw_corr_combined.png +""" + +from __future__ import annotations + +import json +import logging +import sys +import warnings +from pathlib import Path + +import numpy as np +import pandas as pd +import matplotlib +matplotlib.use("Agg") +import matplotlib.pyplot as plt +import matplotlib.colors as mcolors +from scipy import stats +from statsmodels.nonparametric.smoothers_lowess import lowess + +# --------------------------------------------------------------------------- +# Paths +# --------------------------------------------------------------------------- +ROOT = Path(__file__).resolve().parent.parent +NMDB_DIR = ROOT / "data" / "raw" / "nmdb" +USGS_DIR = ROOT / "data" / "raw" / "usgs" +SIDC_DIR = ROOT / "data" / "raw" / "sidc" +AVAIL_FILE = ROOT / "results" / "data_availability.json" +OUT_DIR = ROOT / "results" +FIG_DIR = ROOT / "results" / "figs" + +# Add src/ to path +sys.path.insert(0, str(ROOT / "src")) +from crq.ingest.nmdb import download_station_year, load_station, resample_daily +from crq.ingest.usgs import download_year as download_usgs_year, load_usgs + +# --------------------------------------------------------------------------- +# Constants +# --------------------------------------------------------------------------- +BIN_DAYS = 5 +EPOCH = pd.Timestamp("1976-01-01") +IN_SAMPLE_START = "1976-01-01" +IN_SAMPLE_END = "2019-12-31" +OOS_START = "2020-01-01" +OOS_END = "2025-04-29" +COMBINED_START = "1976-01-01" +COMBINED_END = "2025-04-29" +MIN_MAG = 4.5 +MIN_STATIONS = 3 +COVERAGE_THRESHOLD = 0.60 + +logging.basicConfig( + level=logging.INFO, + format="%(asctime)s %(levelname)s %(message)s", + datefmt="%H:%M:%S", +) +log = logging.getLogger(__name__) + +# --------------------------------------------------------------------------- +# Helper: 5-day bin index +# --------------------------------------------------------------------------- + +def _bin_index(ts: pd.Timestamp) -> int: + return (ts - EPOCH).days // BIN_DAYS + + +def _bin_start(b: int) -> pd.Timestamp: + return EPOCH + pd.Timedelta(days=b * BIN_DAYS) + + +def _decimal_year(ts: pd.DatetimeIndex) -> np.ndarray: + """Convert DatetimeIndex to decimal year (for scatter colouring).""" + yr = ts.year.values.astype(float) + day_of_yr = ts.day_of_year.values.astype(float) + days_in_yr = np.where( + pd.DatetimeIndex(ts).is_leap_year, 366.0, 365.0 + ) + return yr + (day_of_yr - 1.0) / days_in_yr + + +# --------------------------------------------------------------------------- +# Download helpers +# --------------------------------------------------------------------------- + +def _ensure_usgs(years: range) -> None: + """Download any missing USGS yearly files.""" + for yr in years: + dest = USGS_DIR / f"usgs-{yr}.csv" + if dest.exists() and dest.stat().st_size > 100: + continue + log.info("Downloading USGS %d …", yr) + try: + download_usgs_year(yr, USGS_DIR, min_magnitude=MIN_MAG) + except Exception as exc: + log.warning("USGS %d download failed: %s", yr, exc) + + +def _oos_stations() -> list[str]: + """Return OOS-good station list from data_availability.json.""" + if AVAIL_FILE.exists(): + with open(AVAIL_FILE) as fh: + data = json.load(fh) + return data.get("good_stations_oos", []) + # Fall back to all stations present in config + import yaml + cfg = yaml.safe_load((ROOT / "config" / "stations.yaml").read_text()) + return list(cfg["stations"].keys()) + + +def _ensure_nmdb_oos(stations: list[str], years: range) -> None: + """Download any missing NMDB station-year files for OOS window.""" + total = len(stations) * len(years) + done = 0 + for stn in stations: + for yr in years: + dest = NMDB_DIR / f"{stn}{yr}.csv" + if dest.exists() and dest.stat().st_size > 0: + done += 1 + continue + log.info("[%d/%d] Downloading NMDB %s %d …", done + 1, total, stn, yr) + try: + download_station_year(stn, yr, NMDB_DIR, sleep_s=0.5) + except Exception as exc: + log.warning("NMDB %s %d download failed: %s", stn, yr, exc) + done += 1 + + +# --------------------------------------------------------------------------- +# Load NMDB: per-bin station distribution +# --------------------------------------------------------------------------- + +def _load_nmdb_bins( + start: str, + end: str, + coverage_thr: float = COVERAGE_THRESHOLD, +) -> pd.DataFrame: + """ + Load all NMDB stations for [start, end] and return a DataFrame of + per-5d-bin statistics across stations: + cr_p05, cr_p25, cr_p50, cr_p75, cr_p95, cr_min, cr_max, cr_n + Each station is normalised by its long-run mean before aggregation. + """ + t0 = pd.Timestamp(start) + t1 = pd.Timestamp(end) + start_yr = t0.year + end_yr = t1.year + + # Determine which stations have files in this window + station_files: dict[str, list[Path]] = {} + for p in sorted(NMDB_DIR.glob("*.csv")): + stem = p.stem # e.g. AATA2018 + stn = "".join(c for c in stem if not c.isdigit()) + yr_str = "".join(c for c in stem if c.isdigit()) + if not yr_str: + continue + yr = int(yr_str) + if yr < start_yr or yr > end_yr: + continue + station_files.setdefault(stn, []).append(p) + + if not station_files: + log.warning("No NMDB files found for %s–%s", start, end) + return pd.DataFrame() + + # Build bin grid + b0 = _bin_index(t0) + b1 = _bin_index(t1) + bin_idx = np.arange(b0, b1 + 1) + bin_starts = pd.DatetimeIndex([_bin_start(b) for b in bin_idx]) + + station_means: dict[str, pd.Series] = {} # station -> per-bin mean (normalised) + + for stn, _ in station_files.items(): + try: + hourly = load_station(stn, start_yr, end_yr, NMDB_DIR) + except Exception as exc: + log.warning("load_station %s failed: %s", stn, exc) + continue + if hourly.empty or stn not in hourly.columns: + continue + + daily = resample_daily(hourly, stn, coverage_threshold=coverage_thr) + daily_vals = daily[stn].loc[start:end] + if daily_vals.isna().all(): + continue + + # Normalise by station long-run mean (ignore NaN) + station_mean = daily_vals.mean(skipna=True) + if np.isnan(station_mean) or station_mean == 0: + continue + daily_norm = daily_vals / station_mean + + # Resample to 5-day bins (mean within bin) + # Use bin index as grouper + day_index = daily_norm.index + bin_of_day = ((day_index - EPOCH).days // BIN_DAYS) + grp = daily_norm.groupby(bin_of_day) + bin_mean = grp.mean() # index = bin integer + bin_mean.index = [_bin_start(b) for b in bin_mean.index] + bin_mean = bin_mean.reindex(bin_starts) + station_means[stn] = bin_mean + + if not station_means: + log.warning("No valid stations for %s–%s", start, end) + return pd.DataFrame() + + # Stack into (bins × stations) matrix + mat = pd.DataFrame(station_means).reindex(bin_starts) + + # Per-bin statistics across stations (ignore NaN) + n_valid = mat.notna().sum(axis=1) + mask = n_valid >= MIN_STATIONS # require at least MIN_STATIONS stations + + result = pd.DataFrame(index=bin_starts) + result["cr_p05"] = mat.quantile(0.05, axis=1) + result["cr_p25"] = mat.quantile(0.25, axis=1) + result["cr_p50"] = mat.quantile(0.50, axis=1) + result["cr_p75"] = mat.quantile(0.75, axis=1) + result["cr_p95"] = mat.quantile(0.95, axis=1) + result["cr_min"] = mat.min(axis=1) + result["cr_max"] = mat.max(axis=1) + result["cr_n"] = n_valid.values + + # Mask bins with fewer than MIN_STATIONS stations + for col in result.columns: + if col != "cr_n": + result.loc[~mask, col] = np.nan + + log.info("NMDB %s–%s: %d bins, %d stations, %.1f%% valid bins", + start, end, len(result), len(station_means), + 100.0 * mask.sum() / len(result)) + return result + + +# --------------------------------------------------------------------------- +# Load USGS: per-bin seismic energy E ∝ 10^(1.5·Mw) +# --------------------------------------------------------------------------- + +def _load_seismic_energy(start: str, end: str) -> pd.Series: + """ + Load USGS events for [start, end] and compute per-5d-bin summed seismic + energy E = Σ 10^(1.5 · Mw). Returns log10(E); bins with no events → NaN. + """ + t0 = pd.Timestamp(start) + t1 = pd.Timestamp(end) + events = load_usgs(t0.year, t1.year, USGS_DIR) + + if events.empty or "mag" not in events.columns: + log.warning("No USGS events loaded for %s–%s", start, end) + return pd.Series(dtype=float) + + events = events.loc[start:end] + events = events[events["mag"] >= MIN_MAG].copy() + events["energy"] = np.power(10.0, 1.5 * events["mag"].values) + + # 5-day binning + b0 = _bin_index(t0) + b1 = _bin_index(t1) + bin_idx = np.arange(b0, b1 + 1) + bin_starts = pd.DatetimeIndex([_bin_start(b) for b in bin_idx]) + + day_of_event = events.index.normalize() + bin_of_event = ((day_of_event - EPOCH).days // BIN_DAYS) + events["bin"] = bin_of_event.values + + bin_energy = events.groupby("bin")["energy"].sum() + + result = bin_energy.reindex(bin_idx) + result.index = bin_starts + log.info("Seismic %s–%s: %d events, %.1f%% bins non-zero", + start, end, len(events), + 100.0 * result.notna().sum() / len(result)) + return np.log10(result) # log10(E) for all operations + + +# --------------------------------------------------------------------------- +# Load SIDC sunspots: bin to 5-day, carry raw spread +# --------------------------------------------------------------------------- + +def _load_sunspot_bins(start: str, end: str) -> pd.DataFrame: + """ + Load KSO/SIDC daily sunspot CSV and return a per-5d-bin DataFrame with: + sn_raw_mean, sn_raw_min, sn_raw_max, sn_smooth (365-day rolling mean) + The smoothed series is computed on the daily series before binning. + """ + path = SIDC_DIR / "sunspots.csv" + if not path.exists(): + raise FileNotFoundError(f"Sunspot file not found: {path}") + + # KSO format: Date,Total,North,South,Diff (comma-separated) + # Standard SIDC format: Year;Month;Day;FracYear;SN;StdDev;Nobs;Definitive + # Detect format by reading header + header = path.read_text(encoding="utf-8", errors="replace").splitlines()[0] + if ";" in header and "Year" in header: + # SIDC SILSO format + df = pd.read_csv( + path, sep=";", + names=["Year", "Month", "Day", "FracYear", "SN", "StdDev", "Nobs", "Def"], + header=0, + ) + df["date"] = pd.to_datetime(dict(year=df["Year"], month=df["Month"], day=df["Day"])) + df = df.set_index("date")[["SN"]].rename(columns={"SN": "sn"}) + df["sn"] = pd.to_numeric(df["sn"], errors="coerce") + else: + # KSO comma-separated format + df = pd.read_csv( + path, sep=",", + names=["date", "total", "north", "south", "diff"], + header=0, + ) + df["date"] = pd.to_datetime(df["date"].str.strip(), errors="coerce") + df = df.dropna(subset=["date"]) + df = df.set_index("date") + for col in df.columns: + df[col] = pd.to_numeric(df[col], errors="coerce") + df = df.rename(columns={"total": "sn"})[["sn"]] + + df = df.sort_index() + df = df.loc[~df.index.duplicated(keep="first")] + + # 365-day rolling mean (smoothed solar cycle) + df["sn_smooth"] = df["sn"].rolling(window=365, center=True, min_periods=180).mean() + + # Ensure DatetimeIndex + df.index = pd.to_datetime(df.index, errors="coerce") + df = df[df.index.notna()] + df = df.sort_index() + + # Clip to window + df = df.loc[start:end] + + t0 = pd.Timestamp(start) + t1 = pd.Timestamp(end) + b0 = _bin_index(t0) + b1 = _bin_index(t1) + bin_idx = np.arange(b0, b1 + 1) + bin_starts = pd.DatetimeIndex([_bin_start(b) for b in bin_idx]) + + day_idx = pd.DatetimeIndex(df.index).normalize() + bin_of_day = ((day_idx - EPOCH).days // BIN_DAYS) + df = df.copy() + df["bin"] = bin_of_day.values + + grp = df.groupby("bin") + sn_mean = grp["sn"].mean().reindex(bin_idx) + sn_min = grp["sn"].min().reindex(bin_idx) + sn_max = grp["sn"].max().reindex(bin_idx) + sn_smooth = grp["sn_smooth"].mean().reindex(bin_idx) + result = pd.DataFrame({ + "sn_mean": sn_mean.values, + "sn_min": sn_min.values, + "sn_max": sn_max.values, + "sn_smooth": sn_smooth.values, + }, index=bin_starts) + + log.info("Sunspot %s–%s: %d bins, %d daily records", + start, end, len(result), len(df)) + return result + + +# --------------------------------------------------------------------------- +# Correlation statistics +# --------------------------------------------------------------------------- + +def _pearson_with_ci(x: np.ndarray, y: np.ndarray, alpha: float = 0.05): + """ + Pearson r, p-value, and (1-alpha) CI via Fisher z-transform. + Returns (r, p, ci_lo, ci_hi, n). + """ + mask = np.isfinite(x) & np.isfinite(y) + x, y = x[mask], y[mask] + n = len(x) + if n < 4: + return np.nan, np.nan, np.nan, np.nan, n + r, p = stats.pearsonr(x, y) + z = np.arctanh(r) + se = 1.0 / np.sqrt(n - 3) + z_crit = stats.norm.ppf(1.0 - alpha / 2) + ci_lo = float(np.tanh(z - z_crit * se)) + ci_hi = float(np.tanh(z + z_crit * se)) + return float(r), float(p), ci_lo, ci_hi, n + + +def _spearman(x: np.ndarray, y: np.ndarray): + """Spearman ρ and p-value. Returns (rho, p, n).""" + mask = np.isfinite(x) & np.isfinite(y) + x, y = x[mask], y[mask] + n = len(x) + if n < 4: + return np.nan, np.nan, n + rho, p = stats.spearmanr(x, y) + return float(rho), float(p), n + + +def _correlate_pair( + x: np.ndarray, y: np.ndarray, label: str, window: str, n_tests: int +) -> dict: + """Compute Pearson + Spearman for one (x, y) pair with Bonferroni correction.""" + pr, pp, ci_lo, ci_hi, n = _pearson_with_ci(x, y) + rho, sp, _ = _spearman(x, y) + return { + "label": label, + "window": window, + "n_bins": n, + "pearson_r": pr, + "pearson_p": pp, + "pearson_ci_lo": ci_lo, + "pearson_ci_hi": ci_hi, + "pearson_p_bonf": min(1.0, pp * n_tests) if np.isfinite(pp) else np.nan, + "spearman_rho": rho, + "spearman_p": sp, + "spearman_p_bonf": min(1.0, sp * n_tests) if np.isfinite(sp) else np.nan, + } + + +# --------------------------------------------------------------------------- +# Plotting +# --------------------------------------------------------------------------- + +def _lowess_line(x: np.ndarray, y: np.ndarray, frac: float = 0.4): + """Return (x_sorted, y_smooth) for a LOWESS trend line.""" + mask = np.isfinite(x) & np.isfinite(y) + if mask.sum() < 10: + return np.array([]), np.array([]) + xl, yl = x[mask], y[mask] + order = np.argsort(xl) + xl, yl = xl[order], yl[order] + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + sm = lowess(yl, xl, frac=frac, return_sorted=True) + return sm[:, 0], sm[:, 1] + + +def _scatter_panel( + ax: plt.Axes, + x_med: np.ndarray, + x_lo: np.ndarray, + x_hi: np.ndarray, + x_xlo: np.ndarray, # extreme low (min) + x_xhi: np.ndarray, # extreme high (max) + y: np.ndarray, + times: np.ndarray, # decimal year for colour + xlabel: str, + ylabel: str, + title: str, + corr_text: str, + cmap: str = "plasma", + show_x_bands: bool = True, +) -> None: + """ + Scatter plot with: + - Points coloured by time (decimal year) + - p5–p95 band as horizontal error bars (light) + - min–max band as even lighter error bars + - LOWESS trend line + - Correlation annotation + """ + mask = np.isfinite(x_med) & np.isfinite(y) + xm, ym, tm = x_med[mask], y[mask], times[mask] + + if len(tm) == 0: + ax.set_title(title + "\n(no data)", fontsize=9) + ax.text(0.5, 0.5, "No data", transform=ax.transAxes, + ha="center", va="center", fontsize=10, color="gray") + return + + norm = mcolors.Normalize(vmin=tm.min(), vmax=tm.max()) + sc = ax.scatter(xm, ym, c=tm, cmap=cmap, norm=norm, s=8, alpha=0.65, zorder=3) + + # Uncertainty bands on x (station percentile spread) + if show_x_bands and x_lo is not None and x_hi is not None: + xlo_m = x_lo[mask] + xhi_m = x_hi[mask] + xerr_lo = np.clip(xm - xlo_m, 0, None) + xerr_hi = np.clip(xhi_m - xm, 0, None) + ax.errorbar( + xm, ym, + xerr=[xerr_lo, xerr_hi], + fmt="none", ecolor="steelblue", alpha=0.12, linewidth=0.5, zorder=2, + ) + if show_x_bands and x_xlo is not None and x_xhi is not None: + xxlo_m = x_xlo[mask] + xxhi_m = x_xhi[mask] + xerr_lo2 = np.clip(xm - xxlo_m, 0, None) + xerr_hi2 = np.clip(xxhi_m - xm, 0, None) + ax.errorbar( + xm, ym, + xerr=[xerr_lo2, xerr_hi2], + fmt="none", ecolor="steelblue", alpha=0.06, linewidth=0.4, zorder=1, + ) + + # LOWESS trend + xl, yl = _lowess_line(xm, ym) + if len(xl): + ax.plot(xl, yl, "k-", linewidth=1.5, zorder=4, label="LOWESS") + + ax.set_xlabel(xlabel, fontsize=8) + ax.set_ylabel(ylabel, fontsize=8) + ax.set_title(title, fontsize=9, fontweight="bold") + ax.tick_params(labelsize=7) + ax.text( + 0.03, 0.97, corr_text, + transform=ax.transAxes, fontsize=7, + va="top", ha="left", + bbox=dict(boxstyle="round,pad=0.3", fc="white", alpha=0.8), + ) + + # Colourbar + cbar = plt.colorbar(sc, ax=ax, shrink=0.55, pad=0.02) + cbar.set_label("Year", fontsize=7) + cbar.ax.tick_params(labelsize=6) + + +def _make_figure( + window_label: str, + cr_bins: pd.DataFrame, + seis_log_e: pd.Series, + sun_bins: pd.DataFrame, + stats_list: list[dict], +) -> plt.Figure: + """ + Three-panel figure for one time window: + Panel 1: CR (p50) vs log10(Seismic energy) + Panel 2: CR (p50) vs Sunspot (smoothed) + Panel 3: Sunspot (smoothed) vs log10(Seismic energy) + """ + fig, axes = plt.subplots(1, 3, figsize=(14, 4.5)) + fig.suptitle( + f"Raw pairwise correlations — {window_label} (no detrending)", + fontsize=11, fontweight="bold", y=1.01, + ) + + # Align all series to same index + idx = cr_bins.index + cr_p50 = cr_bins["cr_p50"].reindex(idx).values + cr_p05 = cr_bins["cr_p05"].reindex(idx).values + cr_p95 = cr_bins["cr_p95"].reindex(idx).values + cr_min = cr_bins["cr_min"].reindex(idx).values + cr_max = cr_bins["cr_max"].reindex(idx).values + seis = seis_log_e.reindex(idx).values + sn_sm = sun_bins["sn_smooth"].reindex(idx).values + sn_raw = sun_bins["sn_mean"].reindex(idx).values + sn_min = sun_bins["sn_min"].reindex(idx).values + sn_max = sun_bins["sn_max"].reindex(idx).values + times = _decimal_year(idx) + + def _fmt_stat(d: dict | None, key_r: str, key_rho: str) -> str: + if d is None: + return "" + r = d.get(key_r, np.nan) + rho = d.get(key_rho, np.nan) + pp = d.get("pearson_p", np.nan) + sp = d.get("spearman_p", np.nan) + def _pstr(p): + if not np.isfinite(p): + return "—" + if p < 0.001: + return "p<0.001" + return f"p={p:.3f}" + rs = f"r={r:.3f}" if np.isfinite(r) else "r=—" + rhos = f"ρ={rho:.3f}" if np.isfinite(rho) else "ρ=—" + return f"{rs} {_pstr(pp)}\n{rhos} {_pstr(sp)}" + + # Find stat records for this window + def _find(label: str) -> dict | None: + for d in stats_list: + if d["window"] == window_label and d["label"] == label: + return d + return None + + # Panel 1: CR vs Seismic + ax = axes[0] + _scatter_panel( + ax, cr_p50, cr_p05, cr_p95, cr_min, cr_max, seis, times, + xlabel="CR index (station median, norm.)", + ylabel="log₁₀(Seismic energy)", + title="CR vs Seismicity", + corr_text=_fmt_stat(_find("CR_p50 vs Seismic"), "pearson_r", "spearman_rho"), + show_x_bands=True, + ) + + # Panel 2: CR vs Sunspot + ax = axes[1] + _scatter_panel( + ax, cr_p50, cr_p05, cr_p95, cr_min, cr_max, sn_sm, times, + xlabel="CR index (station median, norm.)", + ylabel="Sunspot number (smoothed)", + title="CR vs Sunspot Number", + corr_text=_fmt_stat(_find("CR_p50 vs Sunspot"), "pearson_r", "spearman_rho"), + show_x_bands=True, + ) + + # Panel 3: Sunspot vs Seismic (sunspot on x with daily spread as error) + ax = axes[2] + _scatter_panel( + ax, sn_sm, sn_min, sn_max, None, None, seis, times, + xlabel="Sunspot number (365d smoothed)", + ylabel="log₁₀(Seismic energy)", + title="Sunspot Number vs Seismicity", + corr_text=_fmt_stat(_find("Sunspot vs Seismic"), "pearson_r", "spearman_rho"), + show_x_bands=True, + ) + # Add raw sunspot spread as lighter error bars + mask3 = np.isfinite(sn_sm) & np.isfinite(seis) + ax.errorbar( + sn_sm[mask3], seis[mask3], + xerr=[ + np.clip(sn_sm[mask3] - sn_min[mask3], 0, None), + np.clip(sn_max[mask3] - sn_sm[mask3], 0, None), + ], + fmt="none", ecolor="orange", alpha=0.08, linewidth=0.4, zorder=1, + ) + + fig.tight_layout() + return fig + + +# --------------------------------------------------------------------------- +# Main +# --------------------------------------------------------------------------- + +def main() -> None: + FIG_DIR.mkdir(parents=True, exist_ok=True) + OUT_DIR.mkdir(parents=True, exist_ok=True) + + # ── 1. Download missing OOS data ──────────────────────────────────────── + log.info("Checking OOS USGS data …") + _ensure_usgs(range(2020, 2026)) + + log.info("Checking OOS NMDB data …") + oos_stations = _oos_stations() + log.info("OOS stations to download: %d", len(oos_stations)) + _ensure_nmdb_oos(oos_stations, range(2020, 2026)) + + # ── 2. Define windows ─────────────────────────────────────────────────── + windows = { + "In-sample (1976–2019)": (IN_SAMPLE_START, IN_SAMPLE_END), + "OOS (2020–2025)": (OOS_START, OOS_END), + "Combined (1976–2025)": (COMBINED_START, COMBINED_END), + } + + # ── 3. Compute correlations ───────────────────────────────────────────── + n_tests = 9 # 3 pairs × 3 windows, Bonferroni denominator + + all_stats: list[dict] = [] + window_data: dict[str, tuple] = {} + + for win_label, (wstart, wend) in windows.items(): + log.info("=== Window: %s ===", win_label) + + cr_bins = _load_nmdb_bins(wstart, wend) + seis_log_e = _load_seismic_energy(wstart, wend) + sun_bins = _load_sunspot_bins(wstart, wend) + + if cr_bins.empty: + log.warning("No CR data for %s — skipping", win_label) + continue + + window_data[win_label] = (cr_bins, seis_log_e, sun_bins) + + # Align to common index + idx = cr_bins.index + cr_p50 = cr_bins["cr_p50"].reindex(idx).values + cr_p95 = cr_bins["cr_p95"].reindex(idx).values + seis = seis_log_e.reindex(idx).values + sn = sun_bins["sn_smooth"].reindex(idx).values + + pairs = [ + ("CR_p50 vs Seismic", cr_p50, seis), + ("CR_p95 vs Seismic", cr_p95, seis), + ("CR_p50 vs Sunspot", cr_p50, sn), + ("CR_p95 vs Sunspot", cr_p95, sn), + ("Sunspot vs Seismic", sn, seis), + ] + + for label, x, y in pairs: + rec = _correlate_pair(x, y, label, win_label, n_tests) + all_stats.append(rec) + log.info( + " %-30s r=% .3f (p=%.3g) ρ=% .3f (p=%.3g) n=%d", + label, + rec["pearson_r"], rec["pearson_p"], + rec["spearman_rho"], rec["spearman_p"], + rec["n_bins"], + ) + + # ── 4. Save JSON ───────────────────────────────────────────────────────── + out_json = OUT_DIR / "raw_pairwise_correlations.json" + + def _nan_to_none(obj): + if isinstance(obj, float) and np.isnan(obj): + return None + if isinstance(obj, dict): + return {k: _nan_to_none(v) for k, v in obj.items()} + if isinstance(obj, list): + return [_nan_to_none(v) for v in obj] + return obj + + with open(out_json, "w") as fh: + json.dump(_nan_to_none({"n_tests_bonferroni": n_tests, "results": all_stats}), fh, indent=2) + log.info("Saved %s", out_json) + + # ── 5. Print LaTeX table ───────────────────────────────────────────────── + _print_latex_table(all_stats) + + # ── 6. Produce figures ─────────────────────────────────────────────────── + fig_names = { + "In-sample (1976–2019)": "raw_corr_insample.png", + "OOS (2020–2025)": "raw_corr_oos.png", + "Combined (1976–2025)": "raw_corr_combined.png", + } + + for win_label, (cr_bins, seis_log_e, sun_bins) in window_data.items(): + fig = _make_figure(win_label, cr_bins, seis_log_e, sun_bins, all_stats) + fname = FIG_DIR / fig_names[win_label] + fig.savefig(fname, dpi=150, bbox_inches="tight") + plt.close(fig) + log.info("Saved %s", fname) + + log.info("Done.") + + +# --------------------------------------------------------------------------- +# LaTeX table helper +# --------------------------------------------------------------------------- + +def _print_latex_table(stats: list[dict]) -> None: + """Print a LaTeX longtable fragment to stdout.""" + # Primary 9 pairs (CR_p50 + Sunspot vs Seismic) + primary_labels = ["CR_p50 vs Seismic", "CR_p50 vs Sunspot", "Sunspot vs Seismic"] + window_order = [ + "In-sample (1976–2019)", + "OOS (2020–2025)", + "Combined (1976–2025)", + ] + + def _lookup(label, window): + for d in stats: + if d["label"] == label and d["window"] == window: + return d + return {} + + def _rf(v, fmt=".3f"): + if v is None or (isinstance(v, float) and np.isnan(v)): + return "—" + return format(v, fmt) + + def _pstar(p_bonf): + if p_bonf is None or (isinstance(p_bonf, float) and np.isnan(p_bonf)): + return "" + if p_bonf < 0.001: + return "$^{***}$" + if p_bonf < 0.01: + return "$^{**}$" + if p_bonf < 0.05: + return "$^{*}$" + return "" + + # Map labels to display names + label_display = { + "CR_p50 vs Seismic": r"CR (med.) vs Seismicity", + "CR_p50 vs Sunspot": r"CR (med.) vs Sunspot", + "Sunspot vs Seismic": r"Sunspot vs Seismicity", + } + win_display = { + "In-sample (1976–2019)": r"In-sample (1976--2019)", + "OOS (2020–2025)": r"OOS (2020--2025)", + "Combined (1976–2025)": r"Combined (1976--2025)", + } + + lines = [] + lines.append(r"""% Auto-generated by 09_raw_pairwise_correlations.py +\begin{table}[htbp] + \centering + \caption{Raw pairwise correlation statistics across three time windows. + Bonferroni correction applied for $3 \times 3 = 9$ tests. + CR uses the per-bin station-median index. + Seismic energy is $\log_{10}\!\left(\sum 10^{1.5 M_W}\right)$. + Sunspot is the 365-day smoothed daily count. + $^{*}p_\text{Bonf}<0.05$, $^{**}p_\text{Bonf}<0.01$, + $^{***}p_\text{Bonf}<0.001$.} + \label{tab:rawcorr} + \setlength{\tabcolsep}{4pt} + \begin{tabular}{llrrrrrrr} + \toprule + Pair & Window & $N$ & + $r$ & 95\% CI & + $p$ (raw) & $p$ (Bonf.) & + $\rho$ & $p_\rho$ (Bonf.) \\ + \midrule""") + + for lbl in primary_labels: + disp = label_display[lbl] + first = True + for win in window_order: + d = _lookup(lbl, win) + r = d.get("pearson_r") + ci_lo = d.get("pearson_ci_lo") + ci_hi = d.get("pearson_ci_hi") + pp = d.get("pearson_p") + pp_b = d.get("pearson_p_bonf") + rho = d.get("spearman_rho") + sp_b = d.get("spearman_p_bonf") + n = d.get("n_bins", 0) + star = _pstar(pp_b) + rho_star = _pstar(sp_b) + + ci_str = f"[{_rf(ci_lo)}, {_rf(ci_hi)}]" if ci_lo is not None else "—" + row_lbl = disp if first else "" + first = False + + lines.append( + f" {row_lbl} & {win_display.get(win, win)} & {n} & " + f"{_rf(r)}{star} & {ci_str} & " + f"{_rf(pp, '.3g')} & {_rf(pp_b, '.3g')} & " + f"{_rf(rho)}{rho_star} & {_rf(sp_b, '.3g')} \\\\" + ) + lines.append(r" \addlinespace") + + lines.append(r""" \bottomrule + \end{tabular} + \bigskip + + \textit{Note: CR\textsubscript{p95} variant (station 95th percentile instead of median) + gives similar structure; see \texttt{results/raw\_pairwise\_correlations.json}.} +\end{table}""") + + table_text = "\n".join(lines) + print(table_text) + # Also save to file + out = OUT_DIR / "raw_pairwise_table.tex" + out.write_text(table_text + "\n", encoding="utf-8") + log.info("LaTeX table written to %s", out) + + +if __name__ == "__main__": + main()