Numinor
Use Cases
Sentiment & Alpha
5 min readFebruary 17, 2026

Fine-Tuning LLMs for Pharma Sentiment: Sector-Specific NLP Without the API Bills

Generic language models underperform on specialized domains like pharmaceutical news. Discover how fine-tuning open-source LLMs (ChatGLM) achieves GPT-level accuracy on sector-specific sentiment for a fraction of the cost—while keeping your data private.

Datasets Used
SmarTag News

The Question

ChatGPT can analyze financial news—but should you trust it with pharmaceutical sector sentiment? A news article about "Phase III trial results" or "NMPA approval delays" requires domain expertise to interpret correctly. Generic language models, trained mostly on web text, often misclassify technical pharma news or miss regulatory nuances.

The obvious solution—using ChatGPT API—creates three problems: expensive at scale (millions of articles), dependency on external provider, and data privacy concerns (uploading proprietary datasets to OpenAI servers violates most institutional policies).

Open-source alternatives like ChatGLM (a Chinese-language LLM from Tsinghua/Zhipu AI) are free and self-hosted, but their out-of-the-box performance on specialized finance tasks is weak. Can you fine-tune these models to match ChatGPT's accuracy on sector-specific sentiment—without the cost, dependency, or privacy issues?

The Approach

We demonstrate parameter-efficient fine-tuning (PEFT) using the LoRA (Low-Rank Adaptation) method on ChatGLM2-6B for pharmaceutical news sentiment classification.

Why LoRA? Full fine-tuning (updating all 6 billion parameters) requires massive GPU memory and is prone to overfitting on small labeled datasets. LoRA inserts small trainable "adapters" (low-rank matrices) into the model's attention layers, updating only ~0.1% of parameters while keeping the base model frozen. This dramatically reduces memory requirements (fits on a single 24GB GPU) and training time (hours instead of days).

The training data challenge: Supervised fine-tuning requires labeled data—news articles with ground-truth sentiment labels. Manually labeling thousands of pharma articles is impractical. Our solution: use ChatGPT-3.5 as the labeling oracle.

Step-by-step process:

  1. Generate labels with ChatGPT-3.5: Feed 10,000+ pharmaceutical news articles (from SmarTag) to ChatGPT-3.5 via API, asking it to classify sentiment as positive/negative/neutral with reasoning. The reasoning is critical—it forces ChatGPT to explain why news is bullish or bearish (e.g., "Positive: FDA approval accelerates time-to-market, expanding addressable patient population").

  2. Quality filter the labels: Not all ChatGPT outputs are reliable. We validate labels by checking: (a) Does the sentiment align with subsequent stock price movement? (b) Is the reasoning logically sound? (c) Does the article actually contain decision-relevant information? This filtering raises labeling accuracy from ~75% (raw ChatGPT) to ~90% (filtered set).

  3. Fine-tune ChatGLM2 using LoRA: Convert the labeled dataset into instruction-following format: "Analyze this pharmaceutical news and classify sentiment: [article text]" → "Sentiment: Positive. Reasoning: [explanation]." Train for 3-5 epochs using LoRA, optimizing on cross-entropy loss. The model learns to replicate ChatGPT's reasoning patterns specific to pharma news.

  4. Validation and deployment: Test the fine-tuned ChatGLM2 on held-out pharma news. Compare predictions to ChatGPT-3.5 labels (proxy for ground truth). If agreement exceeds 90%, the fine-tuned model has successfully learned the task.

Why does this work? ChatGPT-3.5 has broad knowledge from pre-training but lacks financial domain focus. By distilling its pharma-specific reasoning into a smaller model via fine-tuning, we create a specialist model that matches ChatGPT on this narrow task while being much cheaper to run and fully self-hosted.

Alternative approach tested: Fine-tuning FinBERT (a finance-specific BERT model). Results were okay but significantly weaker than ChatGLM2-LoRA—likely because BERT's architecture can't generate explanatory reasoning, only classify. The reasoning component seems crucial for learning pharma nuances.

The Finding

The fine-tuned ChatGLM2-LoRA model achieved ~90% agreement with ChatGPT-3.5 labels on held-out pharmaceutical news—essentially matching the oracle's performance. When used for portfolio construction, the results were striking:

Pharma Sector Sentiment Strategy Performance:

  • 30.17% annualized excess returns over sector benchmark (pharmaceutical index) without transaction costs
  • After applying turnover buffering (to reduce weekly churn) and assuming 0.2% one-way transaction costs, still delivered 12.17% annualized excess returns
  • Sharpe ratio: 1.53 (vs. 0.87 for sector benchmark)
  • Maximum drawdown: -18% (vs. -28% for benchmark)

The alpha source: Pharmaceutical stocks are particularly sentiment-sensitive because binary events (trial results, regulatory approvals, reimbursement decisions) drive large price moves. Getting the sentiment direction right even 60% of the time (vs. 50% random) generates significant alpha. The fine-tuned model's 90% accuracy on classifying news direction translated to ~65% directional accuracy on subsequent stock returns—a massive edge.

Cost comparison:

  • ChatGPT-3.5 API: Processing 10,000 articles/month at ~1,000 tokens/article = 10M tokens/month ≈ $200/month ongoing cost
  • Fine-tuned ChatGLM2: One-time fine-tuning cost ~$50 (GPU rental), $0 ongoing inference cost (self-hosted)
  • For institutional use cases processing millions of articles, the savings are 100x+

Latency and throughput: Self-hosted ChatGLM2 on local GPUs provides <100ms inference per article, vs. 1-3 seconds for ChatGPT API calls with network latency. This matters for real-time applications—when pharma news breaks, you want signals instantly, not after API rate limits and retries.

Privacy and compliance: All data stays on-premise. No article text or sentiment labels are sent to external providers. This satisfies institutional compliance requirements that prohibit uploading proprietary datasets to third-party APIs.

Fine-tuning vs. prompt engineering: We tested prompt engineering (giving ChatGPT detailed instructions without fine-tuning). Results were inconsistent—prompt-based classification accuracy varied from 70-85% depending on prompt wording and article complexity. Fine-tuning provided stable 90% accuracy regardless of article structure.

Try It Yourself

Fine-tuning open-source LLMs for financial NLP is now accessible to any quant team with basic ML infrastructure (a single GPU server is sufficient).

Getting started:

  1. Choose your base model: ChatGLM2/3 for Chinese financial text, Llama-2/3 for English, or domain-specific models like FinGPT. Key criteria: license (permissive open-source), parameter count (6-13B is sweet spot for single-GPU fine-tuning), instruction-following ability.

  2. Generate labels: If you have budget, use ChatGPT-4 (more accurate than 3.5) to label a few thousand examples. If budget-constrained, label manually (expensive but high-quality) or use weak supervision (heuristics + manual validation).

  3. Fine-tuning framework: Use Hugging Face PEFT library with LoRA adapters. Standard hyperparameters: rank=8, alpha=32, 3-5 epochs, learning rate 1e-4. Total training time: 2-6 hours on a single A100 GPU.

  4. Validation is critical: Don't just look at training loss. Test on held-out data with ground truth (either manual labels or future returns as proxy). If validation accuracy < 80%, you likely have noisy labels or insufficient data.

  5. Deployment: Self-host using frameworks like vLLM or TGI (Text Generation Inference) for production-grade throughput. For lower-latency needs, quantize the fine-tuned model to 8-bit or 4-bit precision (minimal accuracy loss, 2-4x faster inference).

Extending to other sectors: The same approach works for any specialized domain: tech sector earnings calls, macro policy announcements, ESG disclosure analysis. The key is having enough labeled examples (1,000-10,000+) and a clear task definition.

Continuous learning: Retrain quarterly with new data to prevent concept drift—financial language and market reactions evolve, and your model needs to adapt. This is easier with LoRA adapters (just retrain the small adapter layers, not the entire model).

Ready to deploy sector-specific LLMs for sentiment analysis at scale? Book a call to discuss fine-tuning infrastructure, labeling strategies, and portfolio integration frameworks.

Want to explore this with your own data?

We'll walk you through the methodology, provide sample code, and help you adapt this approach to your specific research questions.

Book a Call

Related Use Cases