Add raw pairwise correlation analysis (script 09) and paper section

New script 09_raw_pairwise_correlations.py downloads OOS NMDB/USGS data
and computes Pearson r and Spearman ρ (with Bonferroni correction) for
all three variable pairs across in-sample, OOS, and combined windows.
CR flux is represented by its per-bin station distribution (p5–p95 band
with min–max overlay); seismic energy uses the physically correct
E ∝ 10^(1.5·Mw) sum; sunspots shown with 365-day smoothed + raw spread.

Key findings: CR vs sunspot r=-0.82 to -0.94 (Forbush decrease); CR vs
seismicity r=0.057 raw (OOS: r=0.046, not significant); confounding
triangle motivates HP-filter detrending analysis.

Paper gains a new Section 4.1 "Raw Pairwise Correlations" with three
scatter figures and a 9-test Bonferroni summary table; 24 pages total.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
root 2026-04-24 13:45:50 +02:00
parent f99b2fc04e
commit 817d7ba042
11 changed files with 1302 additions and 1 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 238 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 230 KiB

BIN
paper/figs/raw_corr_oos.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 180 KiB

Binary file not shown.

View file

@ -395,6 +395,209 @@ $(P, \varphi)$ to avoid local minima.
\section{Results} \section{Results}
\label{sec:results} \label{sec:results}
\subsection{Raw Pairwise Correlations Between CR, Seismic, and Sunspot Data}
\label{sec:res:rawcorr}
Before applying any detrending, we characterise the raw statistical relationships
between all three observable time series across the three analysis windows.
This serves as the pre-detrending baseline that motivates the Hodrick--Prescott
filter analysis described in Section~\ref{sec:res:detrend}.
\subsubsection{CR index: station distribution}
The global CR index is not collapsed to a single mean.
Instead, for each 5-day bin $t$ we compute the distribution of normalised
count-rate values across all $n_t \geq 3$ contributing stations,
yielding five order statistics:
$\hat{x}_{t}^{(q)} = q\text{-th percentile}$ for $q \in \{5, 25, 50, 75, 95\}$,
together with the inter-station minimum and maximum.
All stations are normalised by their individual long-run mean before the
cross-station percentile is taken, so the station-median series has a grand
mean near unity.
The scatter panels in Figures~\ref{fig:rawinsample}--\ref{fig:rawcombined}
use $\hat{x}_{t}^{(50)}$ as the central CR value (horizontal axis) and display
the $[\hat{x}_{t}^{(5)}, \hat{x}_{t}^{(95)}]$ inter-station spread as
horizontal error bars, with the $[\hat{x}_{t}^{\min}, \hat{x}_{t}^{\max}]$
range overlaid in a lighter shade.
\subsubsection{Seismic energy metric}
The seismic activity per bin is measured by the total released seismic energy,
computed as the sum of the individual earthquake energies:
\begin{equation}
E_t = \sum_{i \in \mathcal{B}_t} 10^{1.5\, M_{W,i}},
\label{eq:energy}
\end{equation}
where $\mathcal{B}_t$ is the set of $M_W \geq 4.5$ events falling in bin $t$.
Working with $\log_{10} E_t$ removes the extreme skewness of $E_t$ and is
physically preferable to summing $M_W$ directly, which is dimensionally
inconsistent because magnitude is already a logarithmic quantity.
The $\log_{10} E_t$ axis in the scatter panels spans roughly three orders of
magnitude, reflecting the heavy-tailed Gutenberg--Richter distribution of
earthquake sizes.
\subsubsection{Sunspot number}
Daily sunspot numbers from SILSO are smoothed with a 365-day rolling mean to
suppress intra-year variability and isolate the solar-cycle envelope.
Each 5-day bin carries both the smoothed value (scatter-plot axis) and the
daily min--max spread within that bin (shown as additional horizontal error
bars on the sunspot axis in Figure~\ref{fig:rawinsample}--\ref{fig:rawcombined},
panel 3).
\subsubsection{Correlation results}
We compute both Pearson $r$ (linear; Fisher $z$-transform 95\% CI) and
Spearman $\rho$ (rank-based; appropriate for the heavy-tailed marginal
distributions of $E_t$ and CR flux).
Bonferroni correction for $3\ \text{pairs} \times 3\ \text{windows} = 9$
tests is applied; star levels refer to the corrected $p$-values.
Full results are given in Table~\ref{tab:rawcorr}.
\paragraph{CR vs.\ seismicity.}
The raw Pearson correlation between the station-median CR index and
$\log_{10} E_t$ is $r = 0.057$ in the in-sample window ($N = 3214$ bins,
$p_\text{Bonf} = 0.011$, $\rho = 0.071$, $p_\text{Bonf} < 10^{-3}$) and
$r = 0.046$ in the OOS window ($N = 390$, $p_\text{Bonf} = 1.0$).
The correlation is thus only marginally significant after correction and
disappears entirely in the independent OOS window.
Using the station-95th-percentile CR series in place of the median
amplifies the correlation slightly ($r = 0.069$ in-sample), suggesting that
high-flux excursions drive the signal.
\paragraph{CR vs.\ sunspot number.}
The dominant raw relationship in the dataset is a strong anti-correlation
between the CR index and the smoothed sunspot number: $r = -0.820$
($\rho = -0.854$) in-sample, and $r = -0.939$ ($\rho = -0.951$) in the OOS
window.
This is the well-established \emph{Forbush decrease} mechanism: a higher
heliospheric magnetic flux during solar maximum deflects more galactic cosmic
rays before they reach the Earth's atmosphere
\citep{Potgieter2013}, so CR flux and solar activity are naturally anti-phase.
The stronger OOS value ($-0.94$ vs.\ $-0.82$) reflects the wide dynamic range
of Solar Cycle~25 (2020--2025), which reached high activity levels after the
deep minimum of 2019--2020.
\paragraph{Sunspot vs.\ seismicity.}
The raw correlation between the smoothed sunspot number and seismic energy is
negative but small: $r = -0.095$ ($\rho = -0.099$) in-sample
($p_\text{Bonf} < 10^{-6}$) and indistinguishable from zero in the OOS window
($r = -0.023$, $p_\text{Bonf} = 1.0$).
The negative sign is consistent with reports of a weak inverse relationship
between solar activity and global seismicity
\citep{Odintsov2006}, though the OOS failure indicates this effect is not
robust.
\subsubsection{Interpretation: a confounding triangle}
The three raw correlations form a confounding triangle.
The CR--sunspot anti-correlation is strong ($r \approx -0.82$ to $-0.94$)
and physically understood.
The sunspot--seismicity anti-correlation is weak but significant in-sample
($r \approx -0.095$).
Together these imply a spurious \emph{positive} CR--seismicity correlation:
both CR and seismicity are jointly modulated by the $\sim$11-year solar cycle
--- CR increases during solar minima, and seismicity may very weakly increase
during solar minima --- so the two series covary positively without any direct
physical connection.
This confound cannot be resolved by naive cross-correlation; it requires
explicitly removing the shared solar-cycle component.
The HP filter analysis (Section~\ref{sec:res:detrend}) accomplishes exactly
this, and shows that the apparent CR--seismicity signal vanishes once the
solar-cycle trend is eliminated.
These raw correlations are intentionally uncontrolled for solar-cycle
modulation.
They are presented here as the pre-detrending baseline to document the scale
of the confound before any correction is applied, and to highlight that even
the uncorrected CR--seismicity correlation ($r = 0.057$) is far weaker than
the CR--sunspot anti-correlation ($r = -0.82$) that drives it.
\begin{table}[htbp]
\centering
\caption{Raw pairwise correlation statistics across three time windows.
Bonferroni correction applied for $3 \times 3 = 9$ tests.
CR uses the per-bin station-median index ($\hat{x}^{(50)}$).
Seismic energy is $\log_{10}\!\left(\sum 10^{1.5 M_W}\right)$.
Sunspot is the 365-day smoothed daily count.
$^{*}p_\text{Bonf}<0.05$, $^{**}p_\text{Bonf}<0.01$,
$^{***}p_\text{Bonf}<0.001$.}
\label{tab:rawcorr}
\setlength{\tabcolsep}{4pt}
\begin{tabular}{llrrrrrrr}
\toprule
Pair & Window & $N$ &
$r$ & 95\% CI &
$p$ (raw) & $p$ (Bonf.) &
$\rho$ & $p_\rho$ (Bonf.) \\
\midrule
CR vs Seismicity & In-sample & 3214 & $+0.057^{*}$ & $[+0.023, +0.092]$ & $0.0012$ & $0.011$ & $+0.071^{***}$ & $<10^{-3}$ \\
& OOS & 390 & $+0.046$ & $[-0.053, +0.145]$ & $0.362$ & $1.00$ & $+0.018$ & $1.00$ \\
& Combined & 3604 & $+0.055^{**}$ & $[+0.023, +0.088]$ & $0.0009$ & $0.008$ & $+0.065^{***}$ & $<10^{-3}$ \\
\addlinespace
CR vs Sunspot & In-sample & 3109 & $-0.820^{***}$ & $[-0.831, -0.808]$ & $\approx 0$ & $\approx 0$ & $-0.854^{***}$ & $\approx 0$ \\
& OOS & 385 & $-0.939^{***}$ & $[-0.950, -0.926]$ & $\approx 0$ & $\approx 0$ & $-0.951^{***}$ & $\approx 0$ \\
& Combined & 3494 & $-0.815^{***}$ & $[-0.826, -0.804]$ & $\approx 0$ & $\approx 0$ & $-0.844^{***}$ & $\approx 0$ \\
\addlinespace
Sunspot vs Seismicity & In-sample & 3109 & $-0.095^{***}$ & $[-0.130, -0.060]$ & $1.1 \times 10^{-7}$ & $10^{-6}$ & $-0.099^{***}$ & $<10^{-6}$ \\
& OOS & 385 & $-0.023$ & $[-0.123, +0.077]$ & $0.648$ & $1.00$ & $-0.016$ & $1.00$ \\
& Combined & 3494 & $-0.086^{***}$ & $[-0.119, -0.053]$ & $3.4 \times 10^{-7}$ & $3 \times 10^{-6}$ & $-0.086^{***}$ & $<10^{-6}$ \\
\addlinespace
\bottomrule
\end{tabular}
\medskip
\raggedright
\textit{Note:} CR$_{p95}$ variant (station 95th-percentile instead of median)
gives qualitatively identical structure; full results in
\texttt{results/raw\_pairwise\_correlations.json}.
\end{table}
\begin{figure}[htbp]
\centering
\includegraphics[width=\textwidth]{raw_corr_insample.png}
\caption{Raw pairwise scatter plots for the in-sample window (1976--2019,
$N = 3{,}215$ five-day bins).
\textbf{Left}: CR station-median index vs $\log_{10}$ seismic energy; horizontal
error bars span the station $[\hat{x}^{(5)}, \hat{x}^{(95)}]$ spread.
\textbf{Centre}: CR index vs 365-day smoothed sunspot number; the strong
anti-correlation ($r = -0.82$) reflects the Forbush decrease mechanism.
\textbf{Right}: Smoothed sunspot number vs $\log_{10}$ seismic energy; thin
horizontal error bars show the daily sunspot spread within each 5-day bin.
All points are coloured by decimal year (plasma colormap), revealing the
solar-cycle drift: successive cycles trace the same anti-correlated arc in
the centre panel.
Black curves are LOWESS trend lines ($f = 0.4$).}
\label{fig:rawinsample}
\end{figure}
\begin{figure}[htbp]
\centering
\includegraphics[width=\textwidth]{raw_corr_oos.png}
\caption{Raw pairwise scatter plots for the out-of-sample window
(2020--2025, $N = 390$ bins, 27 NMDB stations).
Layout identical to Figure~\ref{fig:rawinsample}.
The CR--sunspot anti-correlation strengthens to $r = -0.939$ during Solar
Cycle~25, which had a particularly wide dynamic range.
The CR--seismicity correlation ($r = 0.046$, $p_\text{Bonf} = 1.0$) is
indistinguishable from zero.}
\label{fig:rawoos}
\end{figure}
\begin{figure}[htbp]
\centering
\includegraphics[width=\textwidth]{raw_corr_combined.png}
\caption{Raw pairwise scatter plots for the full combined window
(1976--2025, $N = 3{,}604$ bins).
The multi-decadal colour gradient in the centre panel shows CR and sunspot
oscillating in anti-phase across four complete solar cycles.
The overall CR--seismicity Pearson correlation ($r = 0.055$, $\rho = 0.065$)
is statistically significant ($p_\text{Bonf} = 0.008$) but quantitatively
negligible, and its sign follows directly from the confounding triangle
described in the text.}
\label{fig:rawcombined}
\end{figure}
\subsection{In-Sample Replication (1976--2019)} \subsection{In-Sample Replication (1976--2019)}
\label{sec:res:insample} \label{sec:res:insample}
@ -661,7 +864,7 @@ at $\tau = -525$~days.
Even setting aside the statistical issues, the proposed mechanism faces severe Even setting aside the statistical issues, the proposed mechanism faces severe
physical constraints. physical constraints.
The total ionisation dose from galactic CRs at the surface is The total ionisation dose from galactic CRs at the surface is
$\sim$\SI{0.3}{\milli\gray\per\year} \citep{Aplin2005} --- far too small to $\sim 0.3$~mGy/year \citep{Aplin2005} --- far too small to
transfer meaningful mechanical energy to fault zones, which require shear-stress transfer meaningful mechanical energy to fault zones, which require shear-stress
changes of order $\sim$0.01--1~MPa to trigger earthquakes. changes of order $\sim$0.01--1~MPa to trigger earthquakes.
Proposed mechanisms via radon ionisation \citep{Pulinets2004} or nuclear Proposed mechanisms via radon ionisation \citep{Pulinets2004} or nuclear

Binary file not shown.

After

Width:  |  Height:  |  Size: 238 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 230 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 180 KiB

View file

@ -0,0 +1,200 @@
{
"n_tests_bonferroni": 9,
"results": [
{
"label": "CR_p50 vs Seismic",
"window": "In-sample (1976\u20132019)",
"n_bins": 3214,
"pearson_r": 0.05710710378868533,
"pearson_p": 0.00119988073256791,
"pearson_ci_lo": 0.02257726973505399,
"pearson_ci_hi": 0.09150085205340468,
"pearson_p_bonf": 0.010798926593111189,
"spearman_rho": 0.07097861545637706,
"spearman_p": 5.63861918381612e-05,
"spearman_p_bonf": 0.0005074757265434508
},
{
"label": "CR_p95 vs Seismic",
"window": "In-sample (1976\u20132019)",
"n_bins": 3214,
"pearson_r": 0.06907963019575922,
"pearson_p": 8.875791465767054e-05,
"pearson_ci_lo": 0.03458782771353293,
"pearson_ci_hi": 0.1034070656655567,
"pearson_p_bonf": 0.0007988212319190349,
"spearman_rho": 0.0772244192684692,
"spearman_p": 1.1713062375821977e-05,
"spearman_p_bonf": 0.0001054175613823978
},
{
"label": "CR_p50 vs Sunspot",
"window": "In-sample (1976\u20132019)",
"n_bins": 3109,
"pearson_r": -0.8196118840402077,
"pearson_p": 0.0,
"pearson_ci_lo": -0.8308273920487026,
"pearson_ci_hi": -0.8077309154215488,
"pearson_p_bonf": 0.0,
"spearman_rho": -0.8537788002205909,
"spearman_p": 0.0,
"spearman_p_bonf": 0.0
},
{
"label": "CR_p95 vs Sunspot",
"window": "In-sample (1976\u20132019)",
"n_bins": 3109,
"pearson_r": -0.7111385470496709,
"pearson_p": 0.0,
"pearson_ci_lo": -0.7280904984781218,
"pearson_ci_hi": -0.6933173005311596,
"pearson_p_bonf": 0.0,
"spearman_rho": -0.7120845180015138,
"spearman_p": 0.0,
"spearman_p_bonf": 0.0
},
{
"label": "Sunspot vs Seismic",
"window": "In-sample (1976\u20132019)",
"n_bins": 3109,
"pearson_r": -0.09493087002363143,
"pearson_p": 1.1388681609654831e-07,
"pearson_ci_lo": -0.12965168680195338,
"pearson_ci_hi": -0.059977540109432424,
"pearson_p_bonf": 1.0249813448689347e-06,
"spearman_rho": -0.09896108460051863,
"spearman_p": 3.215656878418318e-08,
"spearman_p_bonf": 2.894091190576486e-07
},
{
"label": "CR_p50 vs Seismic",
"window": "OOS (2020\u20132025)",
"n_bins": 390,
"pearson_r": 0.046285662686090766,
"pearson_p": 0.3619670906211286,
"pearson_ci_lo": -0.053261360369158024,
"pearson_ci_hi": 0.14492178269732792,
"pearson_p_bonf": 1.0,
"spearman_rho": 0.018431820974901467,
"spearman_p": 0.7167081698328611,
"spearman_p_bonf": 1.0
},
{
"label": "CR_p95 vs Seismic",
"window": "OOS (2020\u20132025)",
"n_bins": 390,
"pearson_r": 0.01741112912354181,
"pearson_p": 0.731775599256422,
"pearson_ci_lo": -0.08203292234056903,
"pearson_ci_hi": 0.1165119037773267,
"pearson_p_bonf": 1.0,
"spearman_rho": 0.005527877339000357,
"spearman_p": 0.913347440746341,
"spearman_p_bonf": 1.0
},
{
"label": "CR_p50 vs Sunspot",
"window": "OOS (2020\u20132025)",
"n_bins": 385,
"pearson_r": -0.9386922203247808,
"pearson_p": 3.195693957272318e-179,
"pearson_ci_lo": -0.9495525580837112,
"pearson_ci_hi": -0.925583111407333,
"pearson_p_bonf": 2.876124561545086e-178,
"spearman_rho": -0.9509485215521011,
"spearman_p": 2.976672353983517e-197,
"spearman_p_bonf": 2.679005118585165e-196
},
{
"label": "CR_p95 vs Sunspot",
"window": "OOS (2020\u20132025)",
"n_bins": 385,
"pearson_r": -0.8009259368888886,
"pearson_p": 2.4808686858547665e-87,
"pearson_ci_lo": -0.8341024755606444,
"pearson_ci_hi": -0.7619757168998428,
"pearson_p_bonf": 2.2327818172692897e-86,
"spearman_rho": -0.8577912916660049,
"spearman_p": 9.430786501080505e-113,
"spearman_p_bonf": 8.487707850972454e-112
},
{
"label": "Sunspot vs Seismic",
"window": "OOS (2020\u20132025)",
"n_bins": 385,
"pearson_r": -0.023356101158924312,
"pearson_p": 0.6477774446611007,
"pearson_ci_lo": -0.12301462492345265,
"pearson_ci_hi": 0.07676878530250154,
"pearson_p_bonf": 1.0,
"spearman_rho": -0.016183406260000095,
"spearman_p": 0.7516005849333192,
"spearman_p_bonf": 1.0
},
{
"label": "CR_p50 vs Seismic",
"window": "Combined (1976\u20132025)",
"n_bins": 3604,
"pearson_r": 0.055280515666118794,
"pearson_p": 0.0008999788202432366,
"pearson_ci_lo": 0.022671514404173917,
"pearson_ci_hi": 0.08777201673979473,
"pearson_p_bonf": 0.008099809382189129,
"spearman_rho": 0.06450971045346121,
"spearman_p": 0.00010643839785029366,
"spearman_p_bonf": 0.0009579455806526429
},
{
"label": "CR_p95 vs Seismic",
"window": "Combined (1976\u20132025)",
"n_bins": 3604,
"pearson_r": 0.06534669803613433,
"pearson_p": 8.643658108903141e-05,
"pearson_ci_lo": 0.03276668669712753,
"pearson_ci_hi": 0.09778798224460038,
"pearson_p_bonf": 0.0007779292298012827,
"spearman_rho": 0.07455587634565956,
"spearman_p": 7.446103364283088e-06,
"spearman_p_bonf": 6.701493027854779e-05
},
{
"label": "CR_p50 vs Sunspot",
"window": "Combined (1976\u20132025)",
"n_bins": 3494,
"pearson_r": -0.8151123489946459,
"pearson_p": 0.0,
"pearson_ci_lo": -0.8259476899624344,
"pearson_ci_hi": -0.8036749974749445,
"pearson_p_bonf": 0.0,
"spearman_rho": -0.8441554235717627,
"spearman_p": 0.0,
"spearman_p_bonf": 0.0
},
{
"label": "CR_p95 vs Sunspot",
"window": "Combined (1976\u20132025)",
"n_bins": 3494,
"pearson_r": -0.7288431646136924,
"pearson_p": 0.0,
"pearson_ci_lo": -0.7440213093422423,
"pearson_ci_hi": -0.7129131861565221,
"pearson_p_bonf": 0.0,
"spearman_rho": -0.7424462187819917,
"spearman_p": 0.0,
"spearman_p_bonf": 0.0
},
{
"label": "Sunspot vs Seismic",
"window": "Combined (1976\u20132025)",
"n_bins": 3494,
"pearson_r": -0.08618292707351498,
"pearson_p": 3.3617107739195996e-07,
"pearson_ci_lo": -0.11900279156634344,
"pearson_ci_hi": -0.05317493830432684,
"pearson_p_bonf": 3.0255396965276398e-06,
"spearman_rho": -0.08626578684354677,
"spearman_p": 3.275322308474069e-07,
"spearman_p_bonf": 2.9477900776266624e-06
}
]
}

View file

@ -0,0 +1,38 @@
% Auto-generated by 09_raw_pairwise_correlations.py
\begin{table}[htbp]
\centering
\caption{Raw pairwise correlation statistics across three time windows.
Bonferroni correction applied for $3 \times 3 = 9$ tests.
CR uses the per-bin station-median index.
Seismic energy is $\log_{10}\!\left(\sum 10^{1.5 M_W}\right)$.
Sunspot is the 365-day smoothed daily count.
$^{*}p_\text{Bonf}<0.05$, $^{**}p_\text{Bonf}<0.01$,
$^{***}p_\text{Bonf}<0.001$.}
\label{tab:rawcorr}
\setlength{\tabcolsep}{4pt}
\begin{tabular}{llrrrrrrr}
\toprule
Pair & Window & $N$ &
$r$ & 95\% CI &
$p$ (raw) & $p$ (Bonf.) &
$\rho$ & $p_\rho$ (Bonf.) \\
\midrule
CR (med.) vs Seismicity & In-sample (1976--2019) & 3214 & 0.057$^{*}$ & [0.023, 0.092] & 0.0012 & 0.0108 & 0.071$^{***}$ & 0.000507 \\
& OOS (2020--2025) & 390 & 0.046 & [-0.053, 0.145] & 0.362 & 1 & 0.018 & 1 \\
& Combined (1976--2025) & 3604 & 0.055$^{**}$ & [0.023, 0.088] & 0.0009 & 0.0081 & 0.065$^{***}$ & 0.000958 \\
\addlinespace
CR (med.) vs Sunspot & In-sample (1976--2019) & 3109 & -0.820$^{***}$ & [-0.831, -0.808] & 0 & 0 & -0.854$^{***}$ & 0 \\
& OOS (2020--2025) & 385 & -0.939$^{***}$ & [-0.950, -0.926] & 3.2e-179 & 2.88e-178 & -0.951$^{***}$ & 2.68e-196 \\
& Combined (1976--2025) & 3494 & -0.815$^{***}$ & [-0.826, -0.804] & 0 & 0 & -0.844$^{***}$ & 0 \\
\addlinespace
Sunspot vs Seismicity & In-sample (1976--2019) & 3109 & -0.095$^{***}$ & [-0.130, -0.060] & 1.14e-07 & 1.02e-06 & -0.099$^{***}$ & 2.89e-07 \\
& OOS (2020--2025) & 385 & -0.023 & [-0.123, 0.077] & 0.648 & 1 & -0.016 & 1 \\
& Combined (1976--2025) & 3494 & -0.086$^{***}$ & [-0.119, -0.053] & 3.36e-07 & 3.03e-06 & -0.086$^{***}$ & 2.95e-06 \\
\addlinespace
\bottomrule
\end{tabular}
\bigskip
\textit{Note: CR\textsubscript{p95} variant (station 95th percentile instead of median)
gives similar structure; see \texttt{results/raw\_pairwise\_correlations.json}.}
\end{table}

View file

@ -0,0 +1,860 @@
#!/usr/bin/env python3
"""
09_raw_pairwise_correlations.py
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Raw pairwise correlations between galactic cosmic-ray flux (CR), global
seismicity, and solar activity (sunspot number) across three time windows:
in-sample : 1976-01-01 to 2019-12-31
OOS : 2020-01-01 to 2025-04-29
combined : 1976-01-01 to 2025-04-29
No HP filtering or detrending is applied. Missing OOS data (NMDB 2020-2025,
USGS 2020-2025) are downloaded automatically.
CR is represented by its per-bin distribution across NMDB stations (p5, p50,
p95, min, max). Seismic energy uses the physically correct E 10^(1.5·Mw)
sum. Two CR variants are correlated: the station-median (p50) and station-p95.
Outputs
-------
results/raw_pairwise_correlations.json
results/figs/raw_corr_insample.png
results/figs/raw_corr_oos.png
results/figs/raw_corr_combined.png
"""
from __future__ import annotations
import json
import logging
import sys
import warnings
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from scipy import stats
from statsmodels.nonparametric.smoothers_lowess import lowess
# ---------------------------------------------------------------------------
# Paths
# ---------------------------------------------------------------------------
ROOT = Path(__file__).resolve().parent.parent
NMDB_DIR = ROOT / "data" / "raw" / "nmdb"
USGS_DIR = ROOT / "data" / "raw" / "usgs"
SIDC_DIR = ROOT / "data" / "raw" / "sidc"
AVAIL_FILE = ROOT / "results" / "data_availability.json"
OUT_DIR = ROOT / "results"
FIG_DIR = ROOT / "results" / "figs"
# Add src/ to path
sys.path.insert(0, str(ROOT / "src"))
from crq.ingest.nmdb import download_station_year, load_station, resample_daily
from crq.ingest.usgs import download_year as download_usgs_year, load_usgs
# ---------------------------------------------------------------------------
# Constants
# ---------------------------------------------------------------------------
BIN_DAYS = 5
EPOCH = pd.Timestamp("1976-01-01")
IN_SAMPLE_START = "1976-01-01"
IN_SAMPLE_END = "2019-12-31"
OOS_START = "2020-01-01"
OOS_END = "2025-04-29"
COMBINED_START = "1976-01-01"
COMBINED_END = "2025-04-29"
MIN_MAG = 4.5
MIN_STATIONS = 3
COVERAGE_THRESHOLD = 0.60
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)s %(message)s",
datefmt="%H:%M:%S",
)
log = logging.getLogger(__name__)
# ---------------------------------------------------------------------------
# Helper: 5-day bin index
# ---------------------------------------------------------------------------
def _bin_index(ts: pd.Timestamp) -> int:
return (ts - EPOCH).days // BIN_DAYS
def _bin_start(b: int) -> pd.Timestamp:
return EPOCH + pd.Timedelta(days=b * BIN_DAYS)
def _decimal_year(ts: pd.DatetimeIndex) -> np.ndarray:
"""Convert DatetimeIndex to decimal year (for scatter colouring)."""
yr = ts.year.values.astype(float)
day_of_yr = ts.day_of_year.values.astype(float)
days_in_yr = np.where(
pd.DatetimeIndex(ts).is_leap_year, 366.0, 365.0
)
return yr + (day_of_yr - 1.0) / days_in_yr
# ---------------------------------------------------------------------------
# Download helpers
# ---------------------------------------------------------------------------
def _ensure_usgs(years: range) -> None:
"""Download any missing USGS yearly files."""
for yr in years:
dest = USGS_DIR / f"usgs-{yr}.csv"
if dest.exists() and dest.stat().st_size > 100:
continue
log.info("Downloading USGS %d", yr)
try:
download_usgs_year(yr, USGS_DIR, min_magnitude=MIN_MAG)
except Exception as exc:
log.warning("USGS %d download failed: %s", yr, exc)
def _oos_stations() -> list[str]:
"""Return OOS-good station list from data_availability.json."""
if AVAIL_FILE.exists():
with open(AVAIL_FILE) as fh:
data = json.load(fh)
return data.get("good_stations_oos", [])
# Fall back to all stations present in config
import yaml
cfg = yaml.safe_load((ROOT / "config" / "stations.yaml").read_text())
return list(cfg["stations"].keys())
def _ensure_nmdb_oos(stations: list[str], years: range) -> None:
"""Download any missing NMDB station-year files for OOS window."""
total = len(stations) * len(years)
done = 0
for stn in stations:
for yr in years:
dest = NMDB_DIR / f"{stn}{yr}.csv"
if dest.exists() and dest.stat().st_size > 0:
done += 1
continue
log.info("[%d/%d] Downloading NMDB %s %d", done + 1, total, stn, yr)
try:
download_station_year(stn, yr, NMDB_DIR, sleep_s=0.5)
except Exception as exc:
log.warning("NMDB %s %d download failed: %s", stn, yr, exc)
done += 1
# ---------------------------------------------------------------------------
# Load NMDB: per-bin station distribution
# ---------------------------------------------------------------------------
def _load_nmdb_bins(
start: str,
end: str,
coverage_thr: float = COVERAGE_THRESHOLD,
) -> pd.DataFrame:
"""
Load all NMDB stations for [start, end] and return a DataFrame of
per-5d-bin statistics across stations:
cr_p05, cr_p25, cr_p50, cr_p75, cr_p95, cr_min, cr_max, cr_n
Each station is normalised by its long-run mean before aggregation.
"""
t0 = pd.Timestamp(start)
t1 = pd.Timestamp(end)
start_yr = t0.year
end_yr = t1.year
# Determine which stations have files in this window
station_files: dict[str, list[Path]] = {}
for p in sorted(NMDB_DIR.glob("*.csv")):
stem = p.stem # e.g. AATA2018
stn = "".join(c for c in stem if not c.isdigit())
yr_str = "".join(c for c in stem if c.isdigit())
if not yr_str:
continue
yr = int(yr_str)
if yr < start_yr or yr > end_yr:
continue
station_files.setdefault(stn, []).append(p)
if not station_files:
log.warning("No NMDB files found for %s%s", start, end)
return pd.DataFrame()
# Build bin grid
b0 = _bin_index(t0)
b1 = _bin_index(t1)
bin_idx = np.arange(b0, b1 + 1)
bin_starts = pd.DatetimeIndex([_bin_start(b) for b in bin_idx])
station_means: dict[str, pd.Series] = {} # station -> per-bin mean (normalised)
for stn, _ in station_files.items():
try:
hourly = load_station(stn, start_yr, end_yr, NMDB_DIR)
except Exception as exc:
log.warning("load_station %s failed: %s", stn, exc)
continue
if hourly.empty or stn not in hourly.columns:
continue
daily = resample_daily(hourly, stn, coverage_threshold=coverage_thr)
daily_vals = daily[stn].loc[start:end]
if daily_vals.isna().all():
continue
# Normalise by station long-run mean (ignore NaN)
station_mean = daily_vals.mean(skipna=True)
if np.isnan(station_mean) or station_mean == 0:
continue
daily_norm = daily_vals / station_mean
# Resample to 5-day bins (mean within bin)
# Use bin index as grouper
day_index = daily_norm.index
bin_of_day = ((day_index - EPOCH).days // BIN_DAYS)
grp = daily_norm.groupby(bin_of_day)
bin_mean = grp.mean() # index = bin integer
bin_mean.index = [_bin_start(b) for b in bin_mean.index]
bin_mean = bin_mean.reindex(bin_starts)
station_means[stn] = bin_mean
if not station_means:
log.warning("No valid stations for %s%s", start, end)
return pd.DataFrame()
# Stack into (bins × stations) matrix
mat = pd.DataFrame(station_means).reindex(bin_starts)
# Per-bin statistics across stations (ignore NaN)
n_valid = mat.notna().sum(axis=1)
mask = n_valid >= MIN_STATIONS # require at least MIN_STATIONS stations
result = pd.DataFrame(index=bin_starts)
result["cr_p05"] = mat.quantile(0.05, axis=1)
result["cr_p25"] = mat.quantile(0.25, axis=1)
result["cr_p50"] = mat.quantile(0.50, axis=1)
result["cr_p75"] = mat.quantile(0.75, axis=1)
result["cr_p95"] = mat.quantile(0.95, axis=1)
result["cr_min"] = mat.min(axis=1)
result["cr_max"] = mat.max(axis=1)
result["cr_n"] = n_valid.values
# Mask bins with fewer than MIN_STATIONS stations
for col in result.columns:
if col != "cr_n":
result.loc[~mask, col] = np.nan
log.info("NMDB %s%s: %d bins, %d stations, %.1f%% valid bins",
start, end, len(result), len(station_means),
100.0 * mask.sum() / len(result))
return result
# ---------------------------------------------------------------------------
# Load USGS: per-bin seismic energy E ∝ 10^(1.5·Mw)
# ---------------------------------------------------------------------------
def _load_seismic_energy(start: str, end: str) -> pd.Series:
"""
Load USGS events for [start, end] and compute per-5d-bin summed seismic
energy E = Σ 10^(1.5 · Mw). Returns log10(E); bins with no events NaN.
"""
t0 = pd.Timestamp(start)
t1 = pd.Timestamp(end)
events = load_usgs(t0.year, t1.year, USGS_DIR)
if events.empty or "mag" not in events.columns:
log.warning("No USGS events loaded for %s%s", start, end)
return pd.Series(dtype=float)
events = events.loc[start:end]
events = events[events["mag"] >= MIN_MAG].copy()
events["energy"] = np.power(10.0, 1.5 * events["mag"].values)
# 5-day binning
b0 = _bin_index(t0)
b1 = _bin_index(t1)
bin_idx = np.arange(b0, b1 + 1)
bin_starts = pd.DatetimeIndex([_bin_start(b) for b in bin_idx])
day_of_event = events.index.normalize()
bin_of_event = ((day_of_event - EPOCH).days // BIN_DAYS)
events["bin"] = bin_of_event.values
bin_energy = events.groupby("bin")["energy"].sum()
result = bin_energy.reindex(bin_idx)
result.index = bin_starts
log.info("Seismic %s%s: %d events, %.1f%% bins non-zero",
start, end, len(events),
100.0 * result.notna().sum() / len(result))
return np.log10(result) # log10(E) for all operations
# ---------------------------------------------------------------------------
# Load SIDC sunspots: bin to 5-day, carry raw spread
# ---------------------------------------------------------------------------
def _load_sunspot_bins(start: str, end: str) -> pd.DataFrame:
"""
Load KSO/SIDC daily sunspot CSV and return a per-5d-bin DataFrame with:
sn_raw_mean, sn_raw_min, sn_raw_max, sn_smooth (365-day rolling mean)
The smoothed series is computed on the daily series before binning.
"""
path = SIDC_DIR / "sunspots.csv"
if not path.exists():
raise FileNotFoundError(f"Sunspot file not found: {path}")
# KSO format: Date,Total,North,South,Diff (comma-separated)
# Standard SIDC format: Year;Month;Day;FracYear;SN;StdDev;Nobs;Definitive
# Detect format by reading header
header = path.read_text(encoding="utf-8", errors="replace").splitlines()[0]
if ";" in header and "Year" in header:
# SIDC SILSO format
df = pd.read_csv(
path, sep=";",
names=["Year", "Month", "Day", "FracYear", "SN", "StdDev", "Nobs", "Def"],
header=0,
)
df["date"] = pd.to_datetime(dict(year=df["Year"], month=df["Month"], day=df["Day"]))
df = df.set_index("date")[["SN"]].rename(columns={"SN": "sn"})
df["sn"] = pd.to_numeric(df["sn"], errors="coerce")
else:
# KSO comma-separated format
df = pd.read_csv(
path, sep=",",
names=["date", "total", "north", "south", "diff"],
header=0,
)
df["date"] = pd.to_datetime(df["date"].str.strip(), errors="coerce")
df = df.dropna(subset=["date"])
df = df.set_index("date")
for col in df.columns:
df[col] = pd.to_numeric(df[col], errors="coerce")
df = df.rename(columns={"total": "sn"})[["sn"]]
df = df.sort_index()
df = df.loc[~df.index.duplicated(keep="first")]
# 365-day rolling mean (smoothed solar cycle)
df["sn_smooth"] = df["sn"].rolling(window=365, center=True, min_periods=180).mean()
# Ensure DatetimeIndex
df.index = pd.to_datetime(df.index, errors="coerce")
df = df[df.index.notna()]
df = df.sort_index()
# Clip to window
df = df.loc[start:end]
t0 = pd.Timestamp(start)
t1 = pd.Timestamp(end)
b0 = _bin_index(t0)
b1 = _bin_index(t1)
bin_idx = np.arange(b0, b1 + 1)
bin_starts = pd.DatetimeIndex([_bin_start(b) for b in bin_idx])
day_idx = pd.DatetimeIndex(df.index).normalize()
bin_of_day = ((day_idx - EPOCH).days // BIN_DAYS)
df = df.copy()
df["bin"] = bin_of_day.values
grp = df.groupby("bin")
sn_mean = grp["sn"].mean().reindex(bin_idx)
sn_min = grp["sn"].min().reindex(bin_idx)
sn_max = grp["sn"].max().reindex(bin_idx)
sn_smooth = grp["sn_smooth"].mean().reindex(bin_idx)
result = pd.DataFrame({
"sn_mean": sn_mean.values,
"sn_min": sn_min.values,
"sn_max": sn_max.values,
"sn_smooth": sn_smooth.values,
}, index=bin_starts)
log.info("Sunspot %s%s: %d bins, %d daily records",
start, end, len(result), len(df))
return result
# ---------------------------------------------------------------------------
# Correlation statistics
# ---------------------------------------------------------------------------
def _pearson_with_ci(x: np.ndarray, y: np.ndarray, alpha: float = 0.05):
"""
Pearson r, p-value, and (1-alpha) CI via Fisher z-transform.
Returns (r, p, ci_lo, ci_hi, n).
"""
mask = np.isfinite(x) & np.isfinite(y)
x, y = x[mask], y[mask]
n = len(x)
if n < 4:
return np.nan, np.nan, np.nan, np.nan, n
r, p = stats.pearsonr(x, y)
z = np.arctanh(r)
se = 1.0 / np.sqrt(n - 3)
z_crit = stats.norm.ppf(1.0 - alpha / 2)
ci_lo = float(np.tanh(z - z_crit * se))
ci_hi = float(np.tanh(z + z_crit * se))
return float(r), float(p), ci_lo, ci_hi, n
def _spearman(x: np.ndarray, y: np.ndarray):
"""Spearman ρ and p-value. Returns (rho, p, n)."""
mask = np.isfinite(x) & np.isfinite(y)
x, y = x[mask], y[mask]
n = len(x)
if n < 4:
return np.nan, np.nan, n
rho, p = stats.spearmanr(x, y)
return float(rho), float(p), n
def _correlate_pair(
x: np.ndarray, y: np.ndarray, label: str, window: str, n_tests: int
) -> dict:
"""Compute Pearson + Spearman for one (x, y) pair with Bonferroni correction."""
pr, pp, ci_lo, ci_hi, n = _pearson_with_ci(x, y)
rho, sp, _ = _spearman(x, y)
return {
"label": label,
"window": window,
"n_bins": n,
"pearson_r": pr,
"pearson_p": pp,
"pearson_ci_lo": ci_lo,
"pearson_ci_hi": ci_hi,
"pearson_p_bonf": min(1.0, pp * n_tests) if np.isfinite(pp) else np.nan,
"spearman_rho": rho,
"spearman_p": sp,
"spearman_p_bonf": min(1.0, sp * n_tests) if np.isfinite(sp) else np.nan,
}
# ---------------------------------------------------------------------------
# Plotting
# ---------------------------------------------------------------------------
def _lowess_line(x: np.ndarray, y: np.ndarray, frac: float = 0.4):
"""Return (x_sorted, y_smooth) for a LOWESS trend line."""
mask = np.isfinite(x) & np.isfinite(y)
if mask.sum() < 10:
return np.array([]), np.array([])
xl, yl = x[mask], y[mask]
order = np.argsort(xl)
xl, yl = xl[order], yl[order]
with warnings.catch_warnings():
warnings.simplefilter("ignore")
sm = lowess(yl, xl, frac=frac, return_sorted=True)
return sm[:, 0], sm[:, 1]
def _scatter_panel(
ax: plt.Axes,
x_med: np.ndarray,
x_lo: np.ndarray,
x_hi: np.ndarray,
x_xlo: np.ndarray, # extreme low (min)
x_xhi: np.ndarray, # extreme high (max)
y: np.ndarray,
times: np.ndarray, # decimal year for colour
xlabel: str,
ylabel: str,
title: str,
corr_text: str,
cmap: str = "plasma",
show_x_bands: bool = True,
) -> None:
"""
Scatter plot with:
- Points coloured by time (decimal year)
- p5p95 band as horizontal error bars (light)
- minmax band as even lighter error bars
- LOWESS trend line
- Correlation annotation
"""
mask = np.isfinite(x_med) & np.isfinite(y)
xm, ym, tm = x_med[mask], y[mask], times[mask]
if len(tm) == 0:
ax.set_title(title + "\n(no data)", fontsize=9)
ax.text(0.5, 0.5, "No data", transform=ax.transAxes,
ha="center", va="center", fontsize=10, color="gray")
return
norm = mcolors.Normalize(vmin=tm.min(), vmax=tm.max())
sc = ax.scatter(xm, ym, c=tm, cmap=cmap, norm=norm, s=8, alpha=0.65, zorder=3)
# Uncertainty bands on x (station percentile spread)
if show_x_bands and x_lo is not None and x_hi is not None:
xlo_m = x_lo[mask]
xhi_m = x_hi[mask]
xerr_lo = np.clip(xm - xlo_m, 0, None)
xerr_hi = np.clip(xhi_m - xm, 0, None)
ax.errorbar(
xm, ym,
xerr=[xerr_lo, xerr_hi],
fmt="none", ecolor="steelblue", alpha=0.12, linewidth=0.5, zorder=2,
)
if show_x_bands and x_xlo is not None and x_xhi is not None:
xxlo_m = x_xlo[mask]
xxhi_m = x_xhi[mask]
xerr_lo2 = np.clip(xm - xxlo_m, 0, None)
xerr_hi2 = np.clip(xxhi_m - xm, 0, None)
ax.errorbar(
xm, ym,
xerr=[xerr_lo2, xerr_hi2],
fmt="none", ecolor="steelblue", alpha=0.06, linewidth=0.4, zorder=1,
)
# LOWESS trend
xl, yl = _lowess_line(xm, ym)
if len(xl):
ax.plot(xl, yl, "k-", linewidth=1.5, zorder=4, label="LOWESS")
ax.set_xlabel(xlabel, fontsize=8)
ax.set_ylabel(ylabel, fontsize=8)
ax.set_title(title, fontsize=9, fontweight="bold")
ax.tick_params(labelsize=7)
ax.text(
0.03, 0.97, corr_text,
transform=ax.transAxes, fontsize=7,
va="top", ha="left",
bbox=dict(boxstyle="round,pad=0.3", fc="white", alpha=0.8),
)
# Colourbar
cbar = plt.colorbar(sc, ax=ax, shrink=0.55, pad=0.02)
cbar.set_label("Year", fontsize=7)
cbar.ax.tick_params(labelsize=6)
def _make_figure(
window_label: str,
cr_bins: pd.DataFrame,
seis_log_e: pd.Series,
sun_bins: pd.DataFrame,
stats_list: list[dict],
) -> plt.Figure:
"""
Three-panel figure for one time window:
Panel 1: CR (p50) vs log10(Seismic energy)
Panel 2: CR (p50) vs Sunspot (smoothed)
Panel 3: Sunspot (smoothed) vs log10(Seismic energy)
"""
fig, axes = plt.subplots(1, 3, figsize=(14, 4.5))
fig.suptitle(
f"Raw pairwise correlations — {window_label} (no detrending)",
fontsize=11, fontweight="bold", y=1.01,
)
# Align all series to same index
idx = cr_bins.index
cr_p50 = cr_bins["cr_p50"].reindex(idx).values
cr_p05 = cr_bins["cr_p05"].reindex(idx).values
cr_p95 = cr_bins["cr_p95"].reindex(idx).values
cr_min = cr_bins["cr_min"].reindex(idx).values
cr_max = cr_bins["cr_max"].reindex(idx).values
seis = seis_log_e.reindex(idx).values
sn_sm = sun_bins["sn_smooth"].reindex(idx).values
sn_raw = sun_bins["sn_mean"].reindex(idx).values
sn_min = sun_bins["sn_min"].reindex(idx).values
sn_max = sun_bins["sn_max"].reindex(idx).values
times = _decimal_year(idx)
def _fmt_stat(d: dict | None, key_r: str, key_rho: str) -> str:
if d is None:
return ""
r = d.get(key_r, np.nan)
rho = d.get(key_rho, np.nan)
pp = d.get("pearson_p", np.nan)
sp = d.get("spearman_p", np.nan)
def _pstr(p):
if not np.isfinite(p):
return ""
if p < 0.001:
return "p<0.001"
return f"p={p:.3f}"
rs = f"r={r:.3f}" if np.isfinite(r) else "r=—"
rhos = f"ρ={rho:.3f}" if np.isfinite(rho) else "ρ=—"
return f"{rs} {_pstr(pp)}\n{rhos} {_pstr(sp)}"
# Find stat records for this window
def _find(label: str) -> dict | None:
for d in stats_list:
if d["window"] == window_label and d["label"] == label:
return d
return None
# Panel 1: CR vs Seismic
ax = axes[0]
_scatter_panel(
ax, cr_p50, cr_p05, cr_p95, cr_min, cr_max, seis, times,
xlabel="CR index (station median, norm.)",
ylabel="log₁₀(Seismic energy)",
title="CR vs Seismicity",
corr_text=_fmt_stat(_find("CR_p50 vs Seismic"), "pearson_r", "spearman_rho"),
show_x_bands=True,
)
# Panel 2: CR vs Sunspot
ax = axes[1]
_scatter_panel(
ax, cr_p50, cr_p05, cr_p95, cr_min, cr_max, sn_sm, times,
xlabel="CR index (station median, norm.)",
ylabel="Sunspot number (smoothed)",
title="CR vs Sunspot Number",
corr_text=_fmt_stat(_find("CR_p50 vs Sunspot"), "pearson_r", "spearman_rho"),
show_x_bands=True,
)
# Panel 3: Sunspot vs Seismic (sunspot on x with daily spread as error)
ax = axes[2]
_scatter_panel(
ax, sn_sm, sn_min, sn_max, None, None, seis, times,
xlabel="Sunspot number (365d smoothed)",
ylabel="log₁₀(Seismic energy)",
title="Sunspot Number vs Seismicity",
corr_text=_fmt_stat(_find("Sunspot vs Seismic"), "pearson_r", "spearman_rho"),
show_x_bands=True,
)
# Add raw sunspot spread as lighter error bars
mask3 = np.isfinite(sn_sm) & np.isfinite(seis)
ax.errorbar(
sn_sm[mask3], seis[mask3],
xerr=[
np.clip(sn_sm[mask3] - sn_min[mask3], 0, None),
np.clip(sn_max[mask3] - sn_sm[mask3], 0, None),
],
fmt="none", ecolor="orange", alpha=0.08, linewidth=0.4, zorder=1,
)
fig.tight_layout()
return fig
# ---------------------------------------------------------------------------
# Main
# ---------------------------------------------------------------------------
def main() -> None:
FIG_DIR.mkdir(parents=True, exist_ok=True)
OUT_DIR.mkdir(parents=True, exist_ok=True)
# ── 1. Download missing OOS data ────────────────────────────────────────
log.info("Checking OOS USGS data …")
_ensure_usgs(range(2020, 2026))
log.info("Checking OOS NMDB data …")
oos_stations = _oos_stations()
log.info("OOS stations to download: %d", len(oos_stations))
_ensure_nmdb_oos(oos_stations, range(2020, 2026))
# ── 2. Define windows ───────────────────────────────────────────────────
windows = {
"In-sample (19762019)": (IN_SAMPLE_START, IN_SAMPLE_END),
"OOS (20202025)": (OOS_START, OOS_END),
"Combined (19762025)": (COMBINED_START, COMBINED_END),
}
# ── 3. Compute correlations ─────────────────────────────────────────────
n_tests = 9 # 3 pairs × 3 windows, Bonferroni denominator
all_stats: list[dict] = []
window_data: dict[str, tuple] = {}
for win_label, (wstart, wend) in windows.items():
log.info("=== Window: %s ===", win_label)
cr_bins = _load_nmdb_bins(wstart, wend)
seis_log_e = _load_seismic_energy(wstart, wend)
sun_bins = _load_sunspot_bins(wstart, wend)
if cr_bins.empty:
log.warning("No CR data for %s — skipping", win_label)
continue
window_data[win_label] = (cr_bins, seis_log_e, sun_bins)
# Align to common index
idx = cr_bins.index
cr_p50 = cr_bins["cr_p50"].reindex(idx).values
cr_p95 = cr_bins["cr_p95"].reindex(idx).values
seis = seis_log_e.reindex(idx).values
sn = sun_bins["sn_smooth"].reindex(idx).values
pairs = [
("CR_p50 vs Seismic", cr_p50, seis),
("CR_p95 vs Seismic", cr_p95, seis),
("CR_p50 vs Sunspot", cr_p50, sn),
("CR_p95 vs Sunspot", cr_p95, sn),
("Sunspot vs Seismic", sn, seis),
]
for label, x, y in pairs:
rec = _correlate_pair(x, y, label, win_label, n_tests)
all_stats.append(rec)
log.info(
" %-30s r=% .3f (p=%.3g) ρ=% .3f (p=%.3g) n=%d",
label,
rec["pearson_r"], rec["pearson_p"],
rec["spearman_rho"], rec["spearman_p"],
rec["n_bins"],
)
# ── 4. Save JSON ─────────────────────────────────────────────────────────
out_json = OUT_DIR / "raw_pairwise_correlations.json"
def _nan_to_none(obj):
if isinstance(obj, float) and np.isnan(obj):
return None
if isinstance(obj, dict):
return {k: _nan_to_none(v) for k, v in obj.items()}
if isinstance(obj, list):
return [_nan_to_none(v) for v in obj]
return obj
with open(out_json, "w") as fh:
json.dump(_nan_to_none({"n_tests_bonferroni": n_tests, "results": all_stats}), fh, indent=2)
log.info("Saved %s", out_json)
# ── 5. Print LaTeX table ─────────────────────────────────────────────────
_print_latex_table(all_stats)
# ── 6. Produce figures ───────────────────────────────────────────────────
fig_names = {
"In-sample (19762019)": "raw_corr_insample.png",
"OOS (20202025)": "raw_corr_oos.png",
"Combined (19762025)": "raw_corr_combined.png",
}
for win_label, (cr_bins, seis_log_e, sun_bins) in window_data.items():
fig = _make_figure(win_label, cr_bins, seis_log_e, sun_bins, all_stats)
fname = FIG_DIR / fig_names[win_label]
fig.savefig(fname, dpi=150, bbox_inches="tight")
plt.close(fig)
log.info("Saved %s", fname)
log.info("Done.")
# ---------------------------------------------------------------------------
# LaTeX table helper
# ---------------------------------------------------------------------------
def _print_latex_table(stats: list[dict]) -> None:
"""Print a LaTeX longtable fragment to stdout."""
# Primary 9 pairs (CR_p50 + Sunspot vs Seismic)
primary_labels = ["CR_p50 vs Seismic", "CR_p50 vs Sunspot", "Sunspot vs Seismic"]
window_order = [
"In-sample (19762019)",
"OOS (20202025)",
"Combined (19762025)",
]
def _lookup(label, window):
for d in stats:
if d["label"] == label and d["window"] == window:
return d
return {}
def _rf(v, fmt=".3f"):
if v is None or (isinstance(v, float) and np.isnan(v)):
return ""
return format(v, fmt)
def _pstar(p_bonf):
if p_bonf is None or (isinstance(p_bonf, float) and np.isnan(p_bonf)):
return ""
if p_bonf < 0.001:
return "$^{***}$"
if p_bonf < 0.01:
return "$^{**}$"
if p_bonf < 0.05:
return "$^{*}$"
return ""
# Map labels to display names
label_display = {
"CR_p50 vs Seismic": r"CR (med.) vs Seismicity",
"CR_p50 vs Sunspot": r"CR (med.) vs Sunspot",
"Sunspot vs Seismic": r"Sunspot vs Seismicity",
}
win_display = {
"In-sample (19762019)": r"In-sample (1976--2019)",
"OOS (20202025)": r"OOS (2020--2025)",
"Combined (19762025)": r"Combined (1976--2025)",
}
lines = []
lines.append(r"""% Auto-generated by 09_raw_pairwise_correlations.py
\begin{table}[htbp]
\centering
\caption{Raw pairwise correlation statistics across three time windows.
Bonferroni correction applied for $3 \times 3 = 9$ tests.
CR uses the per-bin station-median index.
Seismic energy is $\log_{10}\!\left(\sum 10^{1.5 M_W}\right)$.
Sunspot is the 365-day smoothed daily count.
$^{*}p_\text{Bonf}<0.05$, $^{**}p_\text{Bonf}<0.01$,
$^{***}p_\text{Bonf}<0.001$.}
\label{tab:rawcorr}
\setlength{\tabcolsep}{4pt}
\begin{tabular}{llrrrrrrr}
\toprule
Pair & Window & $N$ &
$r$ & 95\% CI &
$p$ (raw) & $p$ (Bonf.) &
$\rho$ & $p_\rho$ (Bonf.) \\
\midrule""")
for lbl in primary_labels:
disp = label_display[lbl]
first = True
for win in window_order:
d = _lookup(lbl, win)
r = d.get("pearson_r")
ci_lo = d.get("pearson_ci_lo")
ci_hi = d.get("pearson_ci_hi")
pp = d.get("pearson_p")
pp_b = d.get("pearson_p_bonf")
rho = d.get("spearman_rho")
sp_b = d.get("spearman_p_bonf")
n = d.get("n_bins", 0)
star = _pstar(pp_b)
rho_star = _pstar(sp_b)
ci_str = f"[{_rf(ci_lo)}, {_rf(ci_hi)}]" if ci_lo is not None else ""
row_lbl = disp if first else ""
first = False
lines.append(
f" {row_lbl} & {win_display.get(win, win)} & {n} & "
f"{_rf(r)}{star} & {ci_str} & "
f"{_rf(pp, '.3g')} & {_rf(pp_b, '.3g')} & "
f"{_rf(rho)}{rho_star} & {_rf(sp_b, '.3g')} \\\\"
)
lines.append(r" \addlinespace")
lines.append(r""" \bottomrule
\end{tabular}
\bigskip
\textit{Note: CR\textsubscript{p95} variant (station 95th percentile instead of median)
gives similar structure; see \texttt{results/raw\_pairwise\_correlations.json}.}
\end{table}""")
table_text = "\n".join(lines)
print(table_text)
# Also save to file
out = OUT_DIR / "raw_pairwise_table.tex"
out.write_text(table_text + "\n", encoding="utf-8")
log.info("LaTeX table written to %s", out)
if __name__ == "__main__":
main()