Numinor
Use Cases
Sentiment & Alpha
3 min readDecember 24, 2025

Text Data for Thematic Investing: Capturing Narrative-Driven Market Moves

Beyond stock-specific sentiment—learn how using NLP to identify emerging investment themes (AI infrastructure, green transition) from news and reports enables thematic portfolio construction that adapts as narratives evolve.

Datasets Used
SmarTag News

The Question

Markets move in waves of thematic narratives: AI and automation, carbon neutrality, supply chain reshoring, domestic consumption upgrade. These themes don't fit neatly into traditional sector classifications—an "AI theme" includes chip designers, cloud providers, automation software, and data centers across multiple GICS sectors.

Investors who identify themes early and build baskets of exposed stocks capture the momentum as capital flows into the narrative. But manual thematic research is slow—by the time analysts publish thematic reports and ETFs launch, the easy money has been made.

Can you use NLP on financial text data to automatically detect emerging themes, measure theme intensity, and construct dynamic thematic portfolios that adapt as narratives gain or lose traction?

The Approach

We build a theme detection and tracking system using SmarTag News and analyst report archives:

Step 1: Theme Identification (Unsupervised NLP)
Apply topic modeling (Latent Dirichlet Allocation, BERTopic, or transformer-based clustering) to financial text corpora. This surfaces frequently co-occurring term clusters representing latent themes—without pre-defining what themes to look for.

For example, terms like "lithium-ion," "cathode," "CATL," "battery capacity," "energy density" cluster together, auto-identifying an "EV battery" theme. Terms like "large language model," "GPU," "inference," "training compute" cluster to identify an "AI infrastructure" theme.

Manual refinement is needed—label discovered clusters with human-readable theme names and validate that the term groupings make economic sense.

Step 2: Theme Intensity Tracking
For each theme, calculate a theme momentum index over time:

  • Volume metric: Count of articles mentioning theme keywords / total articles (measures attention share)
  • Sentiment metric: Average sentiment of theme-related articles (measures whether coverage is positive or negative)
  • Breadth metric: Number of distinct companies mentioned in theme-related articles (measures whether theme is concentrating or broadening)

Rising volume + positive sentiment + increasing breadth = theme gaining traction (buy signal)
Falling volume + negative sentiment + narrowing breadth = theme losing momentum (sell signal)

Step 3: Company-Theme Mapping
For each listed company, calculate theme exposure scores based on how frequently they appear in theme-related articles and what products/businesses they have (from SAM) that align with the theme.

Example: "AI infrastructure theme" → high exposure = GPU manufacturers (NVIDIA, AMD), cloud providers (Alibaba Cloud), data center REITs; low exposure = retail banks, consumer staples.

Step 4: Dynamic Portfolio Construction
Each month, select the top 3-5 themes with strongest momentum (rising intensity metrics). Within each theme, overweight companies with highest exposure scores. Rebalance monthly as theme intensities shift—rotating from fading themes to emerging ones.

The Finding

The thematic momentum system successfully identified major narrative waves and captured their alpha:

Identified Theme: "Domestic Substitution" (2022-2023)
NLP detected surging discussion of "localization," "import substitution," "supply chain security" in Chinese policy texts and analyst reports. The system auto-constructed a basket of companies with high domestic production share in critical industries (semiconductors, machinery, materials). This basket outperformed benchmarks by 18% over 12 months as policy support and investor attention flowed to the theme.

Identified Theme: "Carbon Neutrality" (2021-2022)
Volume and sentiment metrics spiked around "carbon peak," "renewable energy," "green hydrogen." The system built a broad basket spanning solar equipment, wind power, EV supply chain, carbon trading, and energy storage. During the theme's hot phase (Q1-Q3 2021), the basket delivered 40%+ returns; the system correctly rotated out when sentiment and volume metrics peaked and started declining in late 2021, avoiding the subsequent correction.

Identified Theme: "Aging Population" (2023-2024)
NLP surfaced an emerging theme around "elderly care," "pension," "healthcare services," "age-tech." The system identified exposed stocks (pharma, medical devices, insurance, senior living REITs) before dedicated thematic ETFs launched. Early positioning captured 12-15% alpha.

Quantitative performance:

  • Theme momentum index (top quintile themes vs bottom quintile): 22% annualized spread
  • Dynamic thematic portfolio (monthly rebalancing): 16% annualized excess returns over market-cap-weighted benchmark with Sharpe 1.1
  • False positives: ~30% of detected themes fizzled (volume rose briefly, then collapsed)—but the system's monthly rebalancing limited exposure to these failures

Why this works:

  1. Narrative drives flows: Institutional investors allocate thematically (ESG mandates, technology focus, policy alignment). When narratives gain traction, capital flows into exposed stocks regardless of fundamentals—creating sustained momentum.

  2. Text leads price: Theme intensity in text (articles, reports) rises before stock prices fully adjust. By the time theme-based ETFs launch or sell-side publishes theme reports, early-stage alpha has been captured.

  3. Dynamic adaptation: Static thematic baskets stagnate (think "blockchain" ETFs after 2017). The system's automatic theme rotation moves capital from dying narratives to emerging ones, sustaining alpha generation.

Try It Yourself

Building a thematic NLP system requires text data pipelines, NLP infrastructure, and portfolio construction frameworks—but the pieces are increasingly accessible.

Implementation steps:

  1. Text data sourcing: Access to financial news (SmarTag), analyst reports, policy documents, social media (if you're brave). Minimum corpus: 10k+ articles/month for stable theme detection.

  2. NLP for theme detection: Use pre-trained models (BERTopic, Sentence-BERT) rather than building from scratch. These models can cluster text semantically without requiring labeled training data. Update theme models quarterly to capture evolving language.

  3. Validation: Manually review discovered themes each quarter. Some clusters are noise (random co-occurrences); others are genuine investment themes. Prune false themes and refine keyword lists.

  4. Exposure mapping: Combine text-based exposure (how often is company X mentioned in theme Y articles) with fundamental exposure (does company X actually produce products aligned with theme Y, per SAM taxonomy). Text alone can mislabel—a company mentioned in "AI" articles might just be using AI, not producing AI tech.

  5. Theme lifecycle management: Not all themes are created equal. Some are long-duration structural shifts (aging population, decarbonization); others are short-lived hype cycles (metaverse, certain crypto narratives). Weight structural themes higher, trade tactical themes faster.

  6. Risk controls: Thematic portfolios can be highly concentrated (overweight specific sectors, market caps, or geographies). Apply risk constraints to avoid extreme bets—sector caps, liquidity filters, max position sizes.

Research extensions:

  • Sentiment evolution: Track not just volume but how themes are discussed. Transitioning from "potential" language to "adoption" language signals theme maturity.
  • Geographic themes: Apply the same framework to detect regional investment themes (ASEAN growth, inland China development) and construct geographic tilt portfolios.
  • Cross-asset themes: Some themes (inflation fears, credit tightening) span asset classes. Build multi-asset thematic strategies using the text-detected themes to allocate across equities, bonds, commodities.

Thematic investing is inherently forward-looking—you're betting on where attention and capital will flow, not where fundamentals are today. NLP makes this systematic rather than speculative.

Ready to build an automated thematic investment system? Book a call to discuss NLP infrastructure, theme detection frameworks, and portfolio construction methodologies.

Source: 中金公司《另类数据策略(3):文本信息助力主题投资》 (2023-09-12).

Want to explore this with your own data?

We'll walk you through the methodology, provide sample code, and help you adapt this approach to your specific research questions.

Book a Call

Related Use Cases