Back

Combining LSTM Neural Networks with Black-Litterman Portfolio Optimization: A Walk-Forward Study on 533 S&P 500 Stocks

Abstract

We present a systematic framework for integrating deep learning return forecasts with Bayesian portfolio construction. Long Short-Term Memory (LSTM) neural networks generate 5-day forward return predictions for 533 S&P 500 constituents, which are incorporated as investor "views" into the Black-Litterman (BL) model. We train 6,396 individual models across 12 quarterly walk-forward folds and introduce a hybrid blending strategy that selectively uses LSTM predictions only where model confidence exceeds a threshold, falling back to mean-variance weights elsewhere. This hybrid approach achieves an annualized return of 110.6% with a Sharpe ratio of 2.54 over a 3-year out-of-sample period (March 2023 -- February 2026), compared to 22.5% (Sharpe 1.45) for the S&P 500 benchmark. We further evaluate Monte Carlo Dropout as a per-prediction uncertainty filter, conduct extensive sensitivity analyses across risk aversion, position concentration, rebalancing frequency, and uncertainty thresholds, and assess statistical significance through the Probabilistic Sharpe Ratio, the Deflated Sharpe Ratio, sub-period consistency, and bootstrap confidence intervals.


1. Introduction

Portfolio optimization has evolved substantially since Markowitz (1952) introduced mean-variance analysis. The Black-Litterman model (Black & Litterman, 1992) addressed key practical limitations of classical mean-variance by combining a market equilibrium prior with investor views through Bayesian updating, producing more stable and intuitive portfolio allocations. Meanwhile, deep learning methods --- particularly recurrent neural networks --- have shown promise in capturing nonlinear temporal dependencies in financial time series.

This work bridges these two paradigms. We use LSTM networks to generate return forecasts for a broad equity universe, then feed those forecasts as views into the Black-Litterman framework. The central challenge is that individual stock prediction accuracy, even with well-calibrated models, hovers near 50% across a large universe. Our key insight is that selective use of predictions --- deploying LSTM views only where model confidence is high, and relying on diversified mean-variance weights elsewhere --- dramatically outperforms both pure deep learning and pure optimization approaches.

We make four contributions:

  1. A scalable LSTM-to-BL pipeline that trains 6,396 per-stock models in walk-forward fashion and integrates predictions as Black-Litterman views with confidence-weighted uncertainty.
  2. A hybrid blending strategy that selectively combines high-confidence LSTM views with mean-variance fallback weights, achieving a Sharpe ratio of 2.54 over 3 years.
  3. Monte Carlo Dropout uncertainty estimation as a dynamic, per-prediction confidence filter that captures real-time model reliability beyond static validation metrics.
  4. Comprehensive sensitivity analyses across five hyperparameter dimensions --- risk aversion, position limits, MC Dropout thresholds, rebalancing frequency, and weight dampening --- with statistical significance testing via the Deflated Sharpe Ratio.

2. Data and Universe Selection

2.1 Data Source and Period

We use daily OHLCV (Open, High, Low, Close, Volume) data for S&P 500 constituents from March 2021 through February 2026, sourced from the Alpaca Markets API. The benchmark is the SPDR S&P 500 ETF (SPY).

2.2 Universe Construction

From the full S&P 500, we select stocks meeting two criteria:

This yields a universe of 533 stocks, providing sufficient history for LSTM training while maintaining broad market coverage.

2.3 Data Quality

Raw OHLCV data can contain artifacts from data providers --- phantom bars at stale prices, particularly around corporate actions. We implement an anomaly filter: zero-volume bars where the close price deviates more than 10% from the median of nearby (plus/minus 10 trading days) valid prices are removed. This filter identifies 29 problematic rows, most notably 23 bars for Texas Pacific Land Corporation (TPL) at a stale pre-split price of ~$251 versus the true price of ~$145, which caused spurious daily swings of +73% / -42%. Failure to clean these artifacts inflated backtest returns by 7--20 percentage points. Gaps created by removal are forward-filled with the last valid price.


3. Feature Engineering

3.1 Feature Set

For each stock, we compute 10 daily features from OHLCV data:

FeatureFormulaRationale
Log returnln(Ct/Ct1)\ln(C_t / C_{t-1})Price momentum
High-low range(HtLt)/Ct(H_t - L_t) / C_tIntraday volatility
Close-open range(CtOt)/Ct(C_t - O_t) / C_tIntraday directional bias
Volume changeΔVt/Vt1\Delta V_t / V_{t-1}Volume momentum
Rolling volatility (5d)σ5(log ret)\sigma_5(\text{log ret})Short-term risk
Rolling volatility (20d)σ20(log ret)\sigma_{20}(\text{log ret})Medium-term risk
Rolling mean (5d)μ5(log ret)\mu_5(\text{log ret})Short-term trend
Rolling mean (20d)μ20(log ret)\mu_{20}(\text{log ret})Medium-term trend
RSI (14-period)Standard RSI formulaOverbought/oversold
MACD signalMACD histogramTrend following

3.2 Normalization

To prevent look-ahead bias, all features are normalized using an expanding-window z-score:

zt=xtxˉ1:tσ1:tz_t = \frac{x_t - \bar{x}_{1:t}}{\sigma_{1:t}}

where xˉ1:t\bar{x}_{1:t} and σ1:t\sigma_{1:t} are the mean and standard deviation computed using only data up to time tt. A minimum of 60 observations is required before normalization begins, ensuring stable statistics.

3.3 Prediction Target

The target variable is the 5-day forward return:

yt=Ct+5Ct1y_t = \frac{C_{t+5}}{C_t} - 1

This horizon balances signal strength (sufficient price movement for meaningful prediction) with rebalancing frequency (weekly).


4. LSTM Architecture and Training

4.1 Model Architecture

We employ a two-layer LSTM architecture to capture temporal dependencies at multiple scales while mitigating overfitting on limited per-stock training data:

InputLSTM(64)Dropout(0.3)LSTM(32)Dropout(0.3)Dense(32,ReLU)Dropout(0.2)Dense(16,ReLU)Dense(1,sigmoid)\text{Input} \rightarrow \text{LSTM}(64) \rightarrow \text{Dropout}(0.3) \rightarrow \text{LSTM}(32) \rightarrow \text{Dropout}(0.3) \rightarrow \text{Dense}(32, \text{ReLU}) \rightarrow \text{Dropout}(0.2) \rightarrow \text{Dense}(16, \text{ReLU}) \rightarrow \text{Dense}(1, \text{sigmoid})

LSTM Architecture

Training uses the Adam optimizer with binary cross-entropy loss, batch size 64, maximum 50 epochs, and early stopping with patience 8 on validation loss (80/20 train/validation split within each fold).

4.2 Walk-Forward Training

Models are trained in a rigorous walk-forward fashion to prevent look-ahead bias:

The first fold trains on approximately 2 years of data (2021-03 to 2023-02), with each subsequent fold adding ~63 days. This expanding-window approach ensures models always train on all available history while maintaining strict temporal separation between training and out-of-sample evaluation.

4.3 Directional Accuracy and Confidence

For each trained model, we measure directional accuracy on the validation set: the fraction of predictions where the sign of the predicted return matches the sign of the realized return. We map this to a confidence score:

confidence=max(0.1,  min(1.0,  2×(accuracy0.5)))\text{confidence} = \max\big(0.1, \; \min(1.0, \; 2 \times (\text{accuracy} - 0.5))\big)

A model with 50% accuracy (random baseline) receives confidence 0.1; a model with 75% accuracy receives confidence 0.5. This confidence directly controls how much weight the Black-Litterman posterior gives to each model's predictions.

4.4 Training Results

Across all 6,396 models:

StatisticValue
Mean directional accuracy49.9%
Median directional accuracy50.8%
Models with accuracy \geq 55%2,099 / 6,396 (32.8%)
Models with accuracy \geq 60%1,122 / 6,396 (17.5%)
Maximum accuracy82.5%

The overall accuracy distribution is centered at 50% --- essentially random for most stocks. This is consistent with the efficient market hypothesis for broad equity prediction and motivates our hybrid approach: rather than using all predictions indiscriminately, we selectively deploy only the most reliable ones.

LSTM quality distribution Figure 1: Distribution of LSTM directional accuracy across 6,396 models (533 stocks x 12 folds). The distribution is centered near 50% (random baseline), with a right tail of genuinely predictive models.


5. Monte Carlo Dropout for Uncertainty Estimation

5.1 Methodology

Standard neural network inference disables dropout at test time, producing a single point prediction with no uncertainty estimate. Monte Carlo (MC) Dropout (Gal & Ghahramani, 2016) provides a principled approximation to Bayesian inference: by keeping dropout active during inference and running multiple forward passes, the variance across predictions approximates the model's epistemic uncertainty.

For each prediction, we run N=10N = 10 forward passes with dropout active:

y^t(k)=fθ(Xt;dropoutk),k=1,,10\hat{y}_t^{(k)} = f_\theta(X_t; \text{dropout}_k), \quad k = 1, \ldots, 10

The final prediction and uncertainty are:

y^t=1Nk=1Ny^t(k),σMC,t=1N1k=1N(y^t(k)y^t)2\hat{y}_t = \frac{1}{N}\sum_{k=1}^N \hat{y}_t^{(k)}, \qquad \sigma_{\text{MC},t} = \sqrt{\frac{1}{N-1}\sum_{k=1}^N \left(\hat{y}_t^{(k)} - \hat{y}_t\right)^2}

5.2 Interpretation

Unlike static directional accuracy (which is fixed per model per fold), MC Dropout uncertainty is dynamic and per-prediction, capturing whether the model is confident about this specific stock on this specific date.

5.3 Strategy Variants

MC Dropout uncertainty is used in two strategy variants:


6. The Black-Litterman Framework

6.1 Market Equilibrium Prior

The Black-Litterman model begins by reverse-engineering the implied expected returns that make the current market portfolio optimal. Assuming the market is in equilibrium, the implied excess returns are:

Π=δΣwmkt\Pi = \delta \cdot \Sigma \cdot w_{\text{mkt}}

where:

This prior represents the market's consensus view: stocks with higher capitalization and higher covariance with the market should have higher expected returns.

6.2 Investor Views

LSTM predictions are expressed as absolute views on individual stock returns:

Pμ=Q+ϵ,ϵN(0,Ω)P \cdot \mu = Q + \epsilon, \qquad \epsilon \sim \mathcal{N}(0, \Omega)

where:

6.3 View Uncertainty (Omega)

The uncertainty assigned to each view determines how strongly it influences the posterior. We construct Ω\Omega as a diagonal matrix:

Ωi,i=τΣi,iconfidencei\Omega_{i,i} = \frac{\tau \cdot \Sigma_{i,i}}{\text{confidence}_i}

where:

This formulation has two desirable properties: (1) predictions for volatile stocks are automatically down-weighted, and (2) models with higher validation accuracy receive proportionally more influence in the posterior.

6.4 Bayesian Posterior

The posterior distribution combines the equilibrium prior and investor views through precision-weighted averaging:

ΣBL=[(τΣ)1+PΩ1P]1\Sigma_{\text{BL}} = \left[(\tau \Sigma)^{-1} + P^\top \Omega^{-1} P\right]^{-1}

μBL=ΣBL[(τΣ)1Π+PΩ1Q]\mu_{\text{BL}} = \Sigma_{\text{BL}} \left[(\tau \Sigma)^{-1} \Pi + P^\top \Omega^{-1} Q\right]

The posterior mean μBL\mu_{\text{BL}} represents a principled blend of market consensus and LSTM predictions, where the blend ratio is determined by the relative precision (inverse uncertainty) of each source.


7. Covariance Estimation

7.1 The Challenge

Estimating a 533×533533 \times 533 covariance matrix from approximately 252 daily observations is an inherently ill-conditioned problem: the number of free parameters (142,011) far exceeds the number of observations. The sample covariance matrix will be nearly singular, leading to unstable portfolio weights.

7.2 Multi-Layer Regularization

We employ five layers of regularization:

  1. Ledoit-Wolf shrinkage (Ledoit & Wolf, 2004): the sample covariance is shrunk toward a structured target (scaled identity) with automatically selected optimal shrinkage intensity λ[0,1]\lambda \in [0, 1]:

Σ^=(1λ)Σsample+λμtargetI\hat{\Sigma} = (1 - \lambda) \Sigma_{\text{sample}} + \lambda \cdot \mu_{\text{target}} \cdot I

For our problem size, typical shrinkage intensity is 0.8--0.95.

  1. Rolling window: a trailing 252-day (1-year) window, balancing recency with sample size.

  2. Ridge regularization: a small diagonal term 108I10^{-8} \cdot I is added before matrix inversion to prevent numerical overflow.

  3. Symmetry enforcement: the matrix is explicitly symmetrized (Σ+Σ)/2(\Sigma + \Sigma^\top) / 2 to correct floating-point asymmetries.

  4. Positive semi-definiteness: the cvxpy solver wraps the covariance matrix using psd_wrap to enforce the PSD constraint during quadratic optimization.


8. Portfolio Optimization

8.1 Objective Function

Given posterior returns μBL\mu_{\text{BL}} and posterior covariance ΣBL\Sigma_{\text{BL}}, we solve a quadratic utility maximization:

maxwwμBLδ2wΣBLw\max_w \quad w^\top \mu_{\text{BL}} - \frac{\delta}{2} w^\top \Sigma_{\text{BL}} \, w

subject to:

wi0,iwi=1,wiwmaxw_i \geq 0, \quad \sum_i w_i = 1, \quad w_i \leq w_{\max}

where wmax=0.10w_{\max} = 0.10 (10% maximum per stock). The constraints enforce a long-only, fully-invested portfolio with bounded position sizes. The optimization is solved using the CLARABEL interior-point conic solver via cvxpy.

8.2 Transaction Costs

At each rebalance, we compute one-way turnover and apply a proportional cost:

Turnover=12iwinewwiold\text{Turnover} = \frac{1}{2} \sum_i |w_i^{\text{new}} - w_i^{\text{old}}|

Cost=Turnover×10 bps\text{Cost} = \text{Turnover} \times 10 \text{ bps}

The 10 basis point assumption covers brokerage commissions and estimated market impact for the stock sizes and trade volumes involved.


9. Strategy Variants

We evaluate 11 strategies encompassing four families:

9.1 LSTM-Based Strategies

BL-LSTM (Pure): All LSTM predictions are used as views regardless of model quality. With most models at ~50% accuracy, the posterior is dominated by noise.

BL-LSTM Hybrid: The key innovation. Only uses LSTM views for stocks where the model's validation directional accuracy exceeds 30%. For remaining stocks, mean-variance weights are used instead. The final portfolio blends BL-optimized weights (for high-confidence stocks) with MV-optimized weights (for low-confidence stocks), then re-normalizes to satisfy constraints. This prevents unreliable predictions from diluting strong signals.

BL-LSTM Hybrid+DO: Adds MC Dropout filtering on top of the accuracy gate. A stock's view is used only if accuracy \geq 0.3 and σMC<0.04\sigma_{\text{MC}} < 0.04. Stocks failing either filter receive MV fallback weights.

BL-LSTM Full DO: Ignores historical accuracy entirely; uses only MC Dropout uncertainty as the filter. Tests whether dynamic per-prediction uncertainty alone is sufficient for view selection.

9.2 Alternative View Sources

To benchmark LSTM predictions against simpler signals, we evaluate four alternative BL view generators using the same optimization framework:

BL-Momentum (Jegadeesh & Titman, 1993): 12-month return with a 1-month skip, cross-sectionally z-scored. Predicted return: zi×0.005z_i \times 0.005. Confidence from z|z| magnitude.

BL-Mean Reversion: Deviation from 50-day moving average. Predicted return: zi×0.003-z_i \times 0.003 (prices below the MA are expected to bounce).

BL-Sector Momentum: Average 3-month return per GICS sector, z-scored cross-sectionally. Captures sector rotation effects.

BL-Combined: Weighted blend of momentum (40%), mean-reversion (30%), and sector momentum (30%).

BL-Rolling Regression: 90-day trailing Ridge regression with exponential sample weighting (halflife 30 days). Features are the same 10 OHLCV indicators. Confidence from weighted R2R^2.

9.3 Baselines

Mean-Variance (MV): Rolling 252-day Markowitz optimization using sample mean returns and Ledoit-Wolf covariance. Same constraints as BL strategies but no LSTM involvement.

SPY Buy-and-Hold: Passive investment in the S&P 500 ETF.


10. Backtest Methodology

10.1 Walk-Forward Design

The backtest period spans March 2023 through February 2026 (756 trading days, approximately 3 years). LSTM models are trained on an expanding window starting from March 2021, with quarterly retraining (every 63 trading days). At each fold boundary, fresh models replace the previous fold's models.

10.2 Daily Simulation Loop

For each trading day:

  1. Compute daily returns for all 533 stocks
  2. Update portfolio value: Vt=Vt1×(1+rtwt1)V_t = V_{t-1} \times (1 + r_t^\top w_{t-1})
  3. Record portfolio value

At each weekly rebalance date (every 5 trading days): 4. Load LSTM models for the current quarterly fold 5. Generate predictions and MC Dropout uncertainty estimates 6. Estimate covariance from trailing 252 days 7. Compute equilibrium prior Π\Pi 8. Build BL posterior with confidence-filtered views 9. Optimize portfolio weights 10. Apply transaction costs to turnover 11. Record new weights

10.3 Independent Validation

We validate results using NautilusTrader, an independent event-driven backtesting framework with 5 basis point per-trade commission modeling. This confirms that our results are not artifacts of the backtesting engine. We also implement a position-drift backtest that tracks fractional shares between rebalances rather than assuming daily rebalancing (the standard fixed-weight approach overstates returns by approximately 3--4% over the 3-year period).


11. Results

11.1 Strategy Comparison

StrategyAnnual ReturnSharpeSortinoMax DrawdownCalmar
BL-LSTM Hybrid110.6%2.543.9425.3%4.36
BL-LSTM Hybrid+DO101.5%2.413.6726.0%3.90
BL-LSTM Full DO93.3%2.243.2329.4%3.17
MV Baseline87.4%1.912.8035.1%2.49
BL-Momentum71.7%1.552.1735.7%2.01
BL-Combined60.6%1.441.9834.7%1.75
BL-Reg (hl=30d)30.9%1.081.5936.1%0.86
SPY22.5%1.451.9218.7%1.20
BL-LSTM (Pure)14.6%1.251.7712.4%1.17
BL-Mean Reversion12.8%0.530.7730.1%0.42
BL-Sector9.5%0.450.6249.9%0.19

Table 1: Out-of-sample performance metrics for all 11 strategies (March 2023 -- February 2026). Transaction costs: 10 bps. Weekly rebalancing.

The results reveal a striking hierarchy. The BL-LSTM Hybrid strategy dominates all alternatives with a Sharpe ratio of 2.54, nearly double SPY's 1.45. The key insight is dramatic: pure BL-LSTM achieves only 14.6% annual return (worse than SPY) because indiscriminate use of ~50%-accuracy predictions dilutes the signal. The hybrid approach, which deploys LSTM views selectively and falls back to MV weights elsewhere, achieves 110.6% --- a 7.6x improvement over the pure approach.

Cumulative returns Figure 2: Cumulative portfolio values for all 11 strategies (log scale). The BL-LSTM Hybrid strategy (red) dominates throughout the backtest period.

Metrics comparison Figure 3: Risk-adjusted metrics comparison across strategies.

11.2 Portfolio Characteristics

The top-performing strategies hold concentrated portfolios:

StrategyMedian PositionsMean PositionsBeta (vs SPY)VaR 95%
BL-LSTM Hybrid1313.11.602.82%
BL-LSTM Hybrid+DO1413.41.582.83%
BL-LSTM Full DO1313.31.682.96%
MV Baseline1111.41.793.40%
SPY5005001.001.40%

The Hybrid strategy's outperformance comes with higher beta (1.60) and daily VaR (2.82%) relative to SPY, reflecting its concentrated positioning. However, its Sharpe ratio of 2.54 substantially exceeds what could be achieved by simply leveraging SPY (which would maintain Sharpe at 1.45 regardless of leverage).

Portfolio composition Figure 4: Top 10 holdings over time for the BL-LSTM Hybrid strategy, showing how the concentrated portfolio evolves through quarterly model retraining.

11.3 Drawdown Analysis

StrategyMax DrawdownRecovery (to new high)
BL-LSTM Hybrid25.3%~3 months
MV Baseline35.1%~5 months
SPY18.7%~4 months

Despite higher returns, the Hybrid strategy's maximum drawdown (25.3%) is moderate and recovers quickly. The Calmar ratio of 4.36 (annualized return / max drawdown) indicates excellent recovery characteristics.

Drawdowns Figure 5: Drawdown time series for key strategies.

Rolling Sharpe Figure 6: 63-day rolling Sharpe ratio. The Hybrid strategy maintains consistently positive rolling Sharpe throughout the backtest.


12. Sensitivity Analyses

12.1 MC Dropout Threshold Sweep

We vary the MC Dropout threshold θ{0.02,0.03,0.04,0.05,0.08,0.10,1.0}\theta \in \{0.02, 0.03, 0.04, 0.05, 0.08, 0.10, 1.0\} for both the Hybrid+DO and Full DO strategies:

Strategyθ\thetaAnnual ReturnSharpeSortinoMax DDCalmar
Hybrid+DO0.02102.2%2.313.5526.8%3.81
Hybrid+DO0.04111.0%2.443.7626.5%4.18
Hybrid+DO0.10112.7%2.493.9024.1%4.68
Hybrid+DO1.0 (off)118.1%2.564.0425.4%4.66
Full DO0.02101.3%2.323.4624.8%4.09
Full DO0.04114.2%2.573.8228.2%4.04
Full DO0.0597.9%2.353.5327.6%3.54
Full DO1.0 (off)78.4%2.013.0126.0%3.02

Table 2: MC Dropout threshold sensitivity. θ=1.0\theta = 1.0 effectively disables the filter.

The two strategies respond differently to the threshold:

This demonstrates that per-prediction MC Dropout uncertainty captures genuine prediction quality --- at the optimal threshold, Full DO (which uses no historical accuracy information) achieves the highest Sharpe ratio of any single configuration.

12.2 Risk Aversion and Position Concentration

We sweep two Black-Litterman hyperparameters: the risk aversion coefficient δ{1.0,2.5,5.0}\delta \in \{1.0, 2.5, 5.0\} and maximum position size wmax{2%,5%,10%}w_{\max} \in \{2\%, 5\%, 10\%\} across all four core strategies (36 configurations total).

StrategyBest δ\deltaBest wmaxw_{\max}SharpeReturnMax DDPositions
BL-LSTM Hybrid5.010%2.62103.7%23.1%14
BL-LSTM Hybrid+DO5.010%2.4795.4%24.0%14
BL-LSTM Full DO5.010%2.3194.5%28.1%13
MV1.010%1.98100.1%37.7%10

Table 3: Best (delta, max_weight) per strategy by Sharpe ratio.

Key findings:

Sharpe heatmaps Figure 7: Sharpe ratio heatmaps for each strategy across (delta, max_weight) combinations. Warmer colors indicate higher Sharpe ratios.

Position count heatmaps Figure 8: Median position counts by strategy and parameter combination. Lower max_weight forces more diversified portfolios.

Equity curves by delta Figure 9: Equity curves at max_weight=10% across three delta values. Higher delta moderates volatility but reduces peak returns for LSTM strategies.

Equity curves by max weight Figure 10: Equity curves at delta=2.5 across three position limits. Higher concentration drives stronger returns.

12.3 Rebalancing Frequency and Execution

We evaluate three dimensions of rebalancing policy: frequency, turnover threshold, and weight dampening.

Frequency (weekly / bi-weekly / monthly):

Strategy5-day10-day20-day
BL-LSTM Hybrid118.1% (2.56)100.1% (2.31)92.9% (2.17)
BL-LSTM Full DO114.2% (2.57)90.4% (2.18)63.7% (1.68)
MV Baseline93.2% (1.93)84.1% (1.82)89.8% (1.88)

Returns and (Sharpe ratios). Weekly rebalancing is optimal for LSTM strategies.

Weekly rebalancing (5d) is critical for LSTM strategies: the 5-day prediction horizon aligns naturally with weekly rebalancing, and longer intervals allow positions to drift away from optimal weights. The MV baseline is less sensitive to frequency since it relies on slower-moving covariance estimates.

Turnover threshold (minimum turnover to trigger rebalance): Negligible impact --- all strategies show less than 1% performance difference across thresholds of 0%, 5%, 10%, and 20%.

Dampening (blend fraction between old and new weights): Full rebalancing (α=100%\alpha = 100\%) outperforms partial adjustment. At α=50%\alpha = 50\%, the Hybrid strategy drops from 118.1% to 102.0% as stale weights accumulate.

Rebalancing frequency Figure 11: Impact of rebalancing frequency on Sharpe ratio across strategies.

Rebalancing heatmap Figure 12: Full rebalancing sweep heatmap showing the interaction of frequency, turnover threshold, and dampening.

12.4 NVDA Sensitivity

Given NVDA's outsized returns during this period, we re-run the full backtest excluding it (532 stocks):

StrategyWith NVDAWithout NVDADifference
BL-LSTM Hybrid110.6%110.1%-0.5pp
BL-LSTM Hybrid+DO101.5%103.5%+2.0pp
BL-LSTM Full DO93.3%90.5%-2.8pp
MV Baseline87.4%85.9%-1.5pp

The Hybrid strategy's performance is essentially unchanged without NVDA --- a single-stock contribution of ~0.5pp out of 110.6%. This confirms the strategy's alpha is diversified across many stock selections, not driven by a single outlier.

NVDA exclusion Figure 13: Side-by-side performance with and without NVDA.


13. Statistical Significance

13.1 Probabilistic Sharpe Ratio (PSR)

The Probabilistic Sharpe Ratio (Bailey & Lopez de Prado, 2012) estimates the probability that the true (population) Sharpe ratio exceeds a benchmark value, accounting for finite sample size, skewness, and excess kurtosis:

PSR(SR^)=Φ(SR^dSRb,dSE)\text{PSR}(\hat{SR}) = \Phi\left(\frac{\hat{SR}_d - SR_{b,d}}{SE}\right)

where SR^d\hat{SR}_d is the observed daily Sharpe ratio, SRb,dSR_{b,d} is the benchmark daily Sharpe, Φ\Phi is the standard normal CDF, and the standard error is:

SE=1γ^3SR^d+γ^414SR^d2T1SE = \sqrt{\frac{1 - \hat{\gamma}_3 \cdot \hat{SR}_d + \frac{\hat{\gamma}_4 - 1}{4} \cdot \hat{SR}_d^2}{T - 1}}

with γ^3\hat{\gamma}_3 (skewness) and γ^4\hat{\gamma}_4 (excess kurtosis) estimated from the return series.

13.2 Deflated Sharpe Ratio (DSR)

Since we evaluated 36 strategy-parameter combinations (9 parameter sets x 4 strategies), multiple testing inflates the probability of finding spuriously high Sharpe ratios. The Deflated Sharpe Ratio (Bailey & Lopez de Prado, 2014) adjusts by replacing the benchmark with the expected maximum Sharpe ratio under the null hypothesis of NN independent strategies with zero true Sharpe:

E[max(SR)]SEd[(1γ)Φ1 ⁣(11N)+γΦ1 ⁣(11Ne)]252E[\max(SR)] \approx SE_d \cdot \left[(1 - \gamma) \Phi^{-1}\!\left(1 - \frac{1}{N}\right) + \gamma \, \Phi^{-1}\!\left(1 - \frac{1}{Ne}\right)\right] \cdot \sqrt{252}

where γ0.5772\gamma \approx 0.5772 is the Euler-Mascheroni constant, SEd=1/T1SE_d = 1/\sqrt{T-1} is the standard error of the daily Sharpe under the null, and the result is annualized by 252\sqrt{252}. The DSR is then computed as PSR with the benchmark inflated to max(SRbenchmark,E[max(SR)])\max(SR_{\text{benchmark}}, E[\max(SR)]).

13.3 Results

For the top-performing configurations (with SPY Sharpe = 1.45 as benchmark, N=36N = 36 trials):

Strategyδ\deltawmaxw_{\max}SharpePSR(>SPY)DSR(>SPY)
BL-LSTM Hybrid5.010%2.6297.8%97.8%
BL-LSTM Hybrid1.010%2.5797.4%97.4%
BL-LSTM Hybrid2.510%2.5497.0%97.0%
BL-LSTM Hybrid+DO5.010%2.4796.1%96.1%
BL-LSTM Hybrid+DO1.010%2.4696.0%96.0%
BL-LSTM Hybrid+DO2.510%2.4195.0%95.0%
BL-LSTM Full DO5.010%2.3193.3%93.3%
MV1.010%1.9881.9%81.9%
MV2.510%1.9178.5%78.5%

Table 4: Statistical significance tests. Bold DSR values exceed 95% (strong evidence of systematic edge).

6 out of 36 configurations pass the DSR > 95% threshold against SPY: all three Hybrid and all three Hybrid+DO variants at wmax=10%w_{\max} = 10\%. These results survive the multiple-testing correction, indicating that the Hybrid strategy's outperformance is unlikely attributable to data snooping across our 36-configuration grid.

The Full DO strategy at Sharpe 2.31 reaches DSR 93.3% --- probable but not conclusive at the 95% level. The MV baseline at Sharpe 1.91 achieves moderate evidence (DSR ~82--89%).

13.4 Sub-Period Consistency

A genuine edge should persist across different time periods. We split the backtest into three approximately 1-year sub-periods:

StrategyY1 (2023-03 to 2024-02)Y2 (2024-03 to 2025-02)Y3 (2025-03 to 2026-02)Wins vs SPY
BL-LSTM Hybrid (d=5)93.3%83.6%127.9%3/3
BL-LSTM Hybrid (d=2.5)107.2%80.2%144.5%3/3
BL-LSTM Hybrid+DO (d=5)85.6%74.3%118.5%3/3
BL-LSTM Full DO (d=5)82.7%74.6%124.1%3/3
MV (d=2.5)99.1%63.0%98.5%3/3
SPY30.8%15.8%20.5%---

Table 5: Sub-period annualized returns. All strategies beat SPY in every sub-period.

Every LSTM-based strategy outperforms SPY in all three years. The edge is not concentrated in a single favorable period.

13.5 Bootstrap Confidence Intervals

We compute 90% confidence intervals for the Sharpe ratio via 10,000 bootstrap resamples of daily returns:

StrategyObserved Sharpe5th pct95th pctEntire CI > SPY?
BL-LSTM Hybrid (d=5)2.621.673.60YES
BL-LSTM Hybrid (d=2.5)2.541.583.48YES
BL-LSTM Hybrid+DO (d=5)2.471.533.48YES
BL-LSTM Full DO (d=5)2.311.373.30no
MV (d=2.5)1.910.942.89no
SPY1.450.482.41---

Table 6: Bootstrap Sharpe ratio confidence intervals (10,000 resamples).

For the top three configurations (Hybrid and Hybrid+DO at 10% max weight), even the 5th percentile of the bootstrap distribution exceeds SPY's observed Sharpe ratio. This is a stringent test: even in the worst 5% of resampled scenarios, these strategies still outperform the benchmark.

Bootstrap distributions Figure 14: Bootstrap Sharpe ratio distributions. The BL-LSTM Hybrid distribution (red) is entirely to the right of SPY's observed Sharpe (dashed line).


14. Discussion

14.1 Why the Hybrid Works

The central finding is that selective deployment of LSTM predictions is far more valuable than comprehensive deployment. Across 533 stocks and 12 quarterly folds, the mean directional accuracy is 49.9% --- essentially random. However, a minority of models (~33% with accuracy above 55%) carry genuine predictive signal. The Hybrid strategy's confidence threshold acts as a quality filter, passing only predictions from this minority into the BL optimizer while allowing the MV baseline to provide diversified exposure elsewhere.

This mirrors a well-known principle in quantitative finance: a weak signal applied broadly often underperforms a strong signal applied selectively.

14.2 MC Dropout as a Dynamic Signal

MC Dropout provides a complementary dimension of quality assessment. While directional accuracy is static (fixed per model per fold), MC Dropout uncertainty varies per-prediction, capturing whether the model is confident about a specific stock on a specific date. The Full DO strategy's optimal Sharpe of 2.57 at θ=0.04\theta = 0.04 --- achieved without any historical accuracy filtering --- demonstrates that per-prediction uncertainty alone is a powerful discriminator.

In practice, the Hybrid strategy (which uses only the simpler accuracy filter) performs nearly as well. This suggests that both signals capture overlapping information: models with high accuracy tend to also produce low-uncertainty predictions.

14.3 Concentration, Diversification, and the Signal Dilution Problem

The parameter sweep over position limits (wmax{2%,5%,10%}w_{\max} \in \{2\%, 5\%, 10\%\}) reveals one of the most consequential tradeoffs in the system: concentration amplifies alpha but also amplifies risk, and at sufficiently low concentration levels, the LSTM-based strategies begin to converge toward --- and in some cases fall behind --- passive cap-weighted benchmarks.

Three Concentration Regimes

The position limit creates three distinct portfolio regimes:

wmaxw_{\max}Median PositionsPortfolio Character
10%12--14Concentrated, high-conviction
5%22--25Moderate diversification
2%50--56Broad, diversified

Figure 18 overlays all four strategies at each concentration tier against the passive benchmarks (SP20, SP50, SPY). The visual contrast is striking: at wmax=10%w_{\max} = 10\%, the active strategies dramatically separate from the benchmarks; at wmax=2%w_{\max} = 2\%, they cluster tightly around SP20 and SP50.

Concentration vs benchmarks Figure 18: Equity curves at each concentration tier (δ=2.5\delta = 2.5). At 10% max weight, active strategies dramatically outperform passive benchmarks. At 2%, they converge toward --- and MV nearly trails --- the cap-weighted SP50.

Portfolio Composition at Maximum Concentration

At wmax=10%w_{\max} = 10\%, the Hybrid strategy's top-5 holdings account for approximately 42% of portfolio weight, with the top-10 holdings comprising 84%. The effective number of positions (1/HHI) is approximately 12, meaning the portfolio's risk profile is dominated by roughly a dozen stocks. Examining the portfolio composition over time (Figure 4), these top holdings rotate meaningfully across quarterly fold boundaries as LSTM models are retrained --- names like APP, PLTR, CVNA, NVDA, and VST feature prominently but their relative sizing shifts, indicating the portfolio is responding to changing model signals rather than statically holding momentum winners.

Hybrid composition Figure 4: Top 15 holdings over time for the BL-LSTM Hybrid strategy at wmax=10%w_{\max} = 10\%. Holdings rotate meaningfully at fold boundaries, with the top-10 capturing ~84% of portfolio weight.

The mean-variance baseline is even more concentrated: its top-5 holdings represent 50% of weight and its top-10 capture 97%, with an effective N of only 10.5. This extreme concentration --- driven by MV's tendency to chase recent high-return stocks without the dampening effect of an equilibrium prior --- contributes to its higher drawdowns (35% vs 25% for Hybrid at the same wmaxw_{\max}).

MV composition Figure 15: Mean-variance portfolio composition. Without the equilibrium prior's moderating influence, MV concentrates even more aggressively into recent winners, driving higher volatility and deeper drawdowns.

Hybrid+DO composition Figure 16: BL-LSTM Hybrid+DO portfolio composition. The MC Dropout filter produces slightly different top holdings compared to the pure Hybrid --- CVNA and SMCI receive more weight when their predictions pass both the accuracy and uncertainty gates.

Full DO composition Figure 17: BL-LSTM Full DO portfolio composition. Without the accuracy gate, the model-selected stocks overlap substantially with the Hybrid but with modestly different weightings.

Diversification Erodes the LSTM Edge

As wmaxw_{\max} decreases from 10% to 2%, a striking pattern emerges: the Sharpe advantage of LSTM-based strategies over passive benchmarks narrows dramatically.

Strategywmaxw_{\max}SharpeSP50 SharpeAdvantage
BL-LSTM Hybrid (d=2.5)10%2.541.50+1.04
BL-LSTM Hybrid (d=2.5)5%2.301.50+0.80
BL-LSTM Hybrid (d=2.5)2%2.031.50+0.53

The Sharpe advantage over SP50 drops by nearly half as we move from concentrated to diversified portfolios.

At wmax=2%w_{\max} = 2\% with ~54 positions, the LSTM strategies hold a large number of MV fallback positions alongside their high-confidence LSTM selections. These fallback positions carry no LSTM alpha --- they are essentially passive, optimized-beta exposure. As their share of the portfolio grows, the overall portfolio converges toward a smarter version of mean-variance, diluting the concentrated LSTM signal that drove the top-line results.

Sub-Period Consistency Across Concentration Tiers

The concentration effect is not just a matter of final returns --- it fundamentally changes whether the strategies beat passive benchmarks consistently. We compare each wmaxw_{\max} tier against its concentration-matched benchmark across three yearly sub-periods:

wmax=10%w_{\max} = 10\% (~13 positions) vs SP20:

Strategy (δ=2.5\delta = 2.5)Y1Y2Y3Wins
SP20 (passive)76.2%24.1%35.9%---
BL-LSTM Hybrid107.2%80.2%144.5%3/3
BL-LSTM Hybrid+DO99.2%69.3%134.9%3/3
BL-LSTM Full DO88.6%72.2%120.0%3/3
MV99.1%63.0%98.5%3/3

At maximum concentration, all strategies beat SP20 in every sub-period by wide margins. Even the worst single-period result (MV at 63.0% in Y2) is 2.6x SP20's 24.1%.

wmax=5%w_{\max} = 5\% (~23 positions) vs SP20:

Strategy (δ=2.5\delta = 2.5)Y1Y2Y3Wins
SP20 (passive)76.2%24.1%35.9%---
BL-LSTM Hybrid81.9%54.7%86.6%3/3
BL-LSTM Hybrid+DO74.7%49.4%89.1%2/3
BL-LSTM Full DO65.2%60.5%93.8%2/3
MV72.8%39.5%63.9%2/3

Cracks begin to appear. Hybrid+DO, Full DO, and MV all trail SP20 in Y1 (2023--2024), the strongest year for large-cap concentration. SP20's 76.2% Y1 return reflects the mega-cap tech rally; at wmax=5%w_{\max} = 5\%, the strategies cannot concentrate enough into those winners. Only the pure Hybrid maintains a 3/3 record.

wmax=2%w_{\max} = 2\% (~54 positions) vs SP50:

Strategy (δ=2.5\delta = 2.5)Y1Y2Y3Wins
SP50 (passive)56.8%17.9%31.5%---
BL-LSTM Hybrid62.0%31.6%51.4%3/3
BL-LSTM Hybrid+DO58.0%29.4%51.2%3/3
BL-LSTM Full DO51.7%40.6%63.1%2/3
MV52.8%25.0%36.0%2/3

At 2% cap, both Full DO and MV trail SP50 in Y1. MV's Y3 performance (36.0% vs 31.5%) is barely positive. These are not catastrophic failures, but they reveal that forced diversification can turn the alpha source into noise.

Expanding the comparison to all delta values and all strategies, we find 21 out of 108 strategy-period combinations where a strategy trails SP20, and 7 out of 108 where a strategy trails even SP50. Every single SP50 underperformance occurs in Y1 at wmax=2%w_{\max} = 2\% --- the period when large-cap concentration was most rewarded.

Cumulative Performance: When SP20 Leads

The sub-period analysis understates the severity of the concentration penalty. Looking at cumulative wealth paths, the picture is more dramatic. Figure 19 plots each strategy's performance at wmax=2%w_{\max} = 2\% relative to SP20 --- values below zero indicate SP20 is ahead.

Relative to SP20 Figure 19: Cumulative performance of wmax=2%w_{\max} = 2\% strategies relative to SP20. Below zero means SP20 is winning. MV never surpasses SP20 during the entire backtest. LSTM strategies spend the majority of the backtest trailing SP20, only pulling ahead in the final months.

The results are sobering:

Strategy (δ=2.5\delta = 2.5, wmax=2%w_{\max} = 2\%)Days SP20 LeadsMax SP20 AdvantageFinal Outcome
BL-LSTM Hybrid96%+0.56Strategy wins (+0.19)
BL-LSTM Hybrid+DO99%+0.67Strategy wins (+0.07)
BL-LSTM Full DO87%+0.45Strategy wins (+0.66)
MV100%+0.81SP20 wins (+0.27)

MV at wmax=2%w_{\max} = 2\% is behind SP20 on literally 100% of trading days across all three delta values. The passive benchmark is ahead by as much as +0.81 (cumulative, on a $1 start) in late 2025. Even the LSTM Hybrid, which ultimately finishes ahead, trails SP20 for 96% of the backtest period. The Hybrid only overtakes SP20 in the final months of 2026 as late-year LSTM signals finally compound enough to overcome the early deficit.

This has profound implications for real-world deployment: an investor running the wmax=2%w_{\max} = 2\% Hybrid+DO would have spent nearly three years watching a simple cap-weighted basket of 20 stocks outperform them, with the strategy only pulling ahead by 7 cents on the dollar at the very end.

Rolling Win Rates

Figure 20 quantifies this temporal inconsistency through rolling 6-month win rates against concentration-matched benchmarks.

Win rates Figure 20: Rolling 6-month win rate versus concentration-matched passive benchmark (δ=2.5\delta = 2.5). At 10% max weight, all strategies beat SP20 in >80% of windows. At 2%, MV drops below a coin flip.

At wmax=10%w_{\max} = 10\%, all four strategies beat SP20 in 80--100% of rolling 6-month windows. At wmax=2%w_{\max} = 2\%, MV's win rate against SP50 drops to 43--57% --- essentially a coin flip. Even the LSTM strategies drop to 50--70% win rates against SP20 at wmax=2%w_{\max} = 2\%, depending on delta. The temporal consistency that makes the concentrated strategies compelling investments (you almost always beat the benchmark over any 6-month window) evaporates at high diversification levels.

Full Comparison Across Position-Limit Tiers

To summarize the tier-by-tier comparison against concentration-matched benchmarks:

wmax=10%w_{\max} = 10\% (~13 positions) vs SP20 (20 positions):

Strategy (d=2.5)ReturnSharpeMaxDDBeta
BL-LSTM Hybrid110.6%2.5425.3%1.60
SP20 benchmark43.7%1.6025.6%1.50

The Hybrid delivers 2.5x the return with comparable drawdown and similar beta. The alpha is overwhelming and consistent across all sub-periods.

wmax=5%w_{\max} = 5\% (~23 positions) vs SP20 (20 positions):

Strategy (d=2.5)ReturnSharpeMaxDDBeta
BL-LSTM Hybrid76.5%2.3023.6%1.43
MV Baseline60.7%1.7728.8%1.54
SP20 benchmark43.7%1.6025.6%1.50

The Hybrid still substantially outperforms. MV comfortably beats SP20 over the full period, though it trails in Y1.

wmax=2%w_{\max} = 2\% (~54 positions) vs SP50 (50 positions):

Strategy (d=2.5)ReturnSharpeMaxDDBeta
BL-LSTM Hybrid49.5%2.0320.5%1.22
MV Baseline39.2%1.6124.2%1.27
SP50 benchmark34.5%1.5024.0%1.37

Alpha persists but has narrowed substantially. MV at δ=5.0\delta = 5.0 returns only 37.0% --- barely above SP50's 34.5%. The Hybrid+DO at δ=5.0\delta = 5.0 returns only 44.8%, edging SP20's 43.7% by just 1.1 percentage points --- a margin that is likely not statistically distinguishable from zero.

Why Diversification Hurts: The Alpha Budget

The mechanism is straightforward. The LSTM's edge is concentrated in a small number of high-conviction predictions --- roughly the 30--35% of models that achieve above-random directional accuracy in any given fold. At wmax=10%w_{\max} = 10\%, the optimizer can allocate meaningful weight to these high-conviction selections, and the remaining portfolio slots (~3--4 low-confidence stocks) carry small enough weight to have limited drag. At wmax=2%w_{\max} = 2\%, the optimizer must spread across ~54 positions. Perhaps 15--20 of these carry LSTM alpha; the remaining 30--35 are MV fallback positions earning only the market premium. Each additional diversification position dilutes the portfolio-level alpha while adding minimal idiosyncratic risk reduction (since the high-confidence LSTM picks are already reasonably diversified across sectors and factors).

This is visible in the Sharpe-per-unit-of-beta ratio, which measures alpha efficiency independent of leverage:

ConfigurationSharpeBetaSharpe/Beta
BL-LSTM Hybrid, wmax=5%w_{\max} = 5\%, δ=5.0\delta = 5.02.381.311.82
BL-LSTM Hybrid, wmax=10%w_{\max} = 10\%, δ=5.0\delta = 5.02.621.461.79
BL-LSTM Full DO, wmax=2%w_{\max} = 2\%, δ=1.0\delta = 1.02.071.331.56
SP20 (passive)1.601.501.07
SP50 (passive)1.501.371.09
SPY1.451.001.45

The passive cap-weighted benchmarks (SP20, SP50) have Sharpe/Beta ratios near 1.0, indicating they offer no alpha beyond leveraged market exposure. All LSTM strategies maintain Sharpe/Beta substantially above 1.0 at every diversification level, confirming genuine alpha generation. However, this ratio drops from ~1.8 at 5--10% max weight to ~1.56 at 2% --- the signal dilution is quantifiable.

The Drawdown Tradeoff

The one clear advantage of diversification is risk reduction. Maximum drawdown falls monotonically as wmaxw_{\max} decreases:

wmaxw_{\max}Hybrid MaxDDMV MaxDDSP50 MaxDD
10%25.3%35.1%---
5%23.6%28.8%---
2%20.5%24.2%24.0%

At wmax=2%w_{\max} = 2\%, the Hybrid strategy's 20.5% maximum drawdown is lower than even SPY's 18.7% adjusted for the strategy's higher beta. For investors who prioritize capital preservation, the 2%-capped Hybrid (Sharpe 2.03, MaxDD 20.5%) may be preferable to the 10%-capped version (Sharpe 2.54, MaxDD 25.3%) despite the lower return.

Practical Implications

The concentration--diversification analysis reveals that the optimal position limit depends on the investor's objectives:

At no concentration level does the LSTM Hybrid strategy fail to beat its passive equivalent over the full period. But the margin narrows from overwhelming (+1.04 Sharpe vs SP50 at 10%) to modest (+0.53 at 2%), the temporal consistency degrades substantially, and the MV baseline nearly loses its edge over passive at the highest diversification levels. This underscores that the LSTM's contribution is primarily an intensive-margin effect --- improving the weight allocated to the best opportunities --- rather than an extensive-margin effect of finding alpha across the full universe.

The critical takeaway for practitioners: if position limits or risk mandates require wmax2%w_{\max} \leq 2\%, the LSTM-based strategies still offer modest improvement over passive alternatives, but the case for the added complexity of model training, MC Dropout inference, and weekly rebalancing becomes significantly weaker. The system's value proposition is strongest when it is allowed to concentrate capital in its highest-conviction predictions.

14.4 Limitations

Several caveats apply:

  1. Backtest period: Three years (2023--2026) includes a strong bull market. Performance in extended bear markets or regime changes remains untested.

  2. Transaction cost assumptions: Our 10bps assumption is optimistic for very large portfolios. For institutional-scale execution, market impact costs may be higher, though the NautilusTrader validation with 5bps confirms directional robustness.

  3. Survivorship bias: The S&P 500 universe is rebalanced periodically; our universe is fixed at the start. Stocks that were delisted or removed during the period are excluded if they had insufficient history.

  4. Look-ahead in universe selection: The 533-stock universe is selected based on data spanning the full period. In live deployment, universe selection would need to be done on a rolling basis.

  5. Scale of models trained: 6,396 models with ~50% average accuracy suggests the individual models have limited predictive power. The strategy's edge may be more attributable to the portfolio construction framework (selective BL + MV fallback) than to the LSTM predictions themselves.


15. Conclusion

We demonstrate that combining LSTM return predictions with Black-Litterman portfolio optimization --- specifically through a selective hybrid approach --- produces substantial out-of-sample alpha over a 3-year backtest on 533 S&P 500 stocks. The key innovation is not the prediction model itself (which achieves near-random accuracy on average) but the confidence-filtered integration: using LSTM views only where evidence supports predictive ability, and falling back to diversified mean-variance allocation elsewhere.

The BL-LSTM Hybrid strategy achieves a Sharpe ratio of 2.54 (vs. 1.45 for SPY) with a maximum drawdown of 25.3%. This outperformance survives the Deflated Sharpe Ratio correction for multiple testing (DSR = 97.0% against SPY at the default parameters), is consistent across all three yearly sub-periods, and maintains its bootstrap 90% confidence interval entirely above SPY's Sharpe.

Monte Carlo Dropout provides an effective per-prediction uncertainty filter, with the Full DO strategy achieving the highest single-configuration Sharpe (2.57) at θ=0.04\theta = 0.04, demonstrating that dynamic uncertainty estimation alone can effectively replace static accuracy metrics for view selection.

Extensive sensitivity analyses reveal that the strategy is robust to hyperparameter choice (all 36 parameter configurations outperform SPY) while showing consistent preference for concentrated portfolios (10% max weight), moderate-to-high risk aversion (δ=5.0\delta = 5.0), and weekly rebalancing aligned with the 5-day prediction horizon.


References


Appendix A: Figure Index

FigureFileDescription
1/article/lstm_quality.pngLSTM directional accuracy distribution
2/article/cumulative_all_views.pngCumulative returns, all strategies
3/article/metrics_bar_all_views.pngMetrics bar chart comparison
4/article/composition_bl_lstm_hybrid.pngHybrid portfolio composition over time
5/article/drawdown_all_views.pngDrawdown time series
6/article/rolling_sharpe_all_views.png63-day rolling Sharpe ratio
7/article/param_sweep_sharpe_heatmap.pngSharpe heatmaps (delta x max_weight)
8/article/param_sweep_positions_heatmap.pngPosition count heatmaps
9/article/param_sweep_equity_by_delta.pngEquity curves by risk aversion
10/article/param_sweep_equity_by_maxweight.pngEquity curves by position limit
11/article/rebalance_frequency_sharpe.pngRebalancing frequency impact
12/article/rebalance_heatmap.pngFull rebalancing sweep heatmap
13/article/nvda_exclusion_comparison.pngNVDA exclusion analysis
14/article/param_sweep_bootstrap_sharpe.pngBootstrap Sharpe distributions
15/article/composition_mv.pngMean-variance portfolio composition
16/article/composition_bl_lstm_hybrid_do.pngHybrid+DO portfolio composition
17/article/composition_bl_lstm_hybrid_full_do.pngFull DO portfolio composition
18/article/param_sweep_concentration_vs_benchmarks.pngStrategies vs passive benchmarks at each concentration tier
19/article/param_sweep_relative_to_sp20.pngCumulative performance relative to SP20 (mw=2%)
20/article/param_sweep_winrate_vs_benchmarks.pngRolling 6-month win rate vs benchmarks by concentration

Appendix B: Additional Figures

FileDescription
/article/accuracy_vs_weight.pngLSTM accuracy vs BL portfolio weight
/article/pred_vs_actual_NVDA.pngLSTM predictions vs realized (NVDA)
/article/bl_views_comparison.png8 BL view sources comparison
/article/cumulative_share_based.pngShare-based backtest equity
/article/share_vs_fixed_comparison.pngFixed-weight vs share-based comparison
/article/nautilus_all_equity.pngNautilusTrader equity curves
/article/nautilus_all_rolling_sharpe.pngNautilusTrader rolling Sharpe
/article/nautilus_all_drawdown.pngNautilusTrader drawdowns
/article/nautilus_all_position_counts.pngPosition counts over time
/article/nautilus_all_concentration.pngHerfindahl concentration index
/article/nautilus_all_1y_distributions.png1-year rolling return distributions