Numinor Co-Movement Graph Construct Data — Data Dictionary v1.0
SKU comovement-graph-v1 · Methodology Numinor Co-Movement Graph Whitepaper v1.3 ·
Canonical repo Numinor-Systems/comovement-codebase (MIT) · June 2026
1. What this is
A point-in-time, pairwise relationship feed over China A-shares that forecasts which names genuinely co-move — and flags which observed correlations are spurious. It is built in two layers:
- The substrate — the news co-movement graph. Every pair in the feed is one the market is co-mentioning in news over the trailing 90 days. This is the candidate universe: broad, timely, and by itself noisy.
- The overlay — the structural grading. Each co-mentioned pair is annotated with four structural "lamps" (deep product peer, SAM supply chain, disclosed customer–supplier, affiliate). A pair with ≥1 lamp is confirmed; a pair with none — co-mentioned but structurally unexplained — is dark (the blacklist).
The product is a risk/correlation layer, not a return signal and not a portfolio. It
ships as two tables: the edge feed (the graph) and a derived per-stock peer set (the
hedging / comparables view). Every number in the whitepaper is reproduced from these tables
by the reference codebase (verify_outputs.py → "✓ matches whitepaper").
This is not a structural-relationship map. A network-only feed would list every supply-chain or affiliate pair across ~6,000 names — most dormant, and silent on which observed correlations are spurious. By scoping to the news substrate, the feed keeps only the links the market is currently acting on, and gains the one field a static map cannot have: the blacklist.
2. Delivery and layout
Follows the Numinor Construct Data standard (one live store, all tiers; see
NUMINOR_CONSTRUCT_DATA_SPEC.md):
$COMOVEMENT_DATA_DIR/comovement_edges/T=YYYY-MM-DD.parquet (the edge feed — the graph)
$COMOVEMENT_DATA_DIR/comovement_peers/T=YYYY-MM-DD.parquet (the per-stock peer set)
$COMOVEMENT_DATA_DIR/model_coefficients.json (the forecast model)
Cadence: a daily point-in-time snapshot (the Ops-Desk daily cron, scripts/cron/run_cron.py).
The live feed publishes one partition per trading day to
s3://numinor-construct-data/comovement/parquet/year=/month=/day=/data.parquet (edges) +
comovement/peers/... (peer set); the trailing-90d co-mention window and trailing correlation roll
each day, so today's partition is the graph as-of today. The research vintage above
(comovement_edges/T=YYYY-MM-DD.parquet, one file per month-end) is what reproduces the whitepaper —
the daily feed's month-ends are exactly those vintages.
The PIT contract is the trade_date column. A partition holds the graph as it stood that day; to
read the graph usable at backtest time t, take the latest partition with trade_date <= t.
3. Table A — comovement_edges (the edge feed)
One row per (trade_date, ts_a, ts_b) news-co-mention A-share pair (~54k pairs/vintage over ~5,450 names). Columns are ordered substrate → overlay → output → scope.
| Layer | Column | Type | Meaning |
|---|---|---|---|
| key | trade_date | date | as-of month-end (the PIT contract) |
| key | ts_a, ts_b | string | the A-share pair, canonical order ts_a < ts_b (NNNNNN.SH/.SZ/.BJ) |
| substrate | comention_strength | int32 | news co-mention weight over the trailing window (how heavily the two are linked in the news) |
| substrate | comention_days | int32 | distinct days the pair was co-mentioned in the window |
| substrate | trailing_corr | double | realized 90-day return co-movement (the normalized measure of §3.2 of the whitepaper) |
| overlay | lamp_product_peer | bool | the two firms share a product (any depth on the SAM tree) — the shallow-peer base |
| overlay | peer_depth | int8 | deepest shared level on the SAM product tree (≥4 = same specific product) |
| overlay | lamp_deep_peer | bool | peer_depth >= 4 — the standout forecasting lamp |
| overlay | lamp_sam_chain | bool | SAM supply-chain link (core inputs), product-derived and continuously available |
| overlay | lamp_disclosed | bool | financial-statement disclosed customer–supplier, active within 2 years, bids excluded |
| overlay | lamp_affiliate | bool | common ownership / cross-holding |
| output | n_lamps | int8 | how many of the four lamps are lit (the relatedness gate) |
| output | confirmed | bool | n_lamps >= 1 — structurally explained co-movement |
| output | dark | bool | n_lamps == 0 — co-mentioned, structurally unconfirmed (the discount flag) |
| output | tier | string | deep_peer / multi / single / dark (the grading ladder) |
| output | expected_fwd_corr | double | forward-correlation forecast = const + trailing·c + dark·c + Σ lamp·c (develop-fit coefficients) |
| output | corr_delta | double | expected_fwd_corr − trailing_corr — where the graph disagrees with the price screen |
| scope | cap_tier | string | CSI300 / CSI500 / CSI1000 / other — pair is in a tier iff both names are members (the hedging edge is large-cap; §8) |
4. Table B — comovement_peers (the per-stock peer set)
Derived from Table A: each name's top-10 structural peers, the raw material for a hedge basket or a comparables set. One row per (trade_date, ts_code, peer_rank).
| Column | Type | Meaning |
|---|---|---|
trade_date | date | as-of month-end |
ts_code | string | the focal A-share name |
peer_rank | int | 1 = closest peer (rank by relatedness, then co-mention strength) |
peer_ts | string | the peer name |
peer_score | int | relatedness score = 2·deep_peer + 2·disclosed + n_lamps |
tier | string | the pair's grading tier |
trailing_corr | double | the pair's trailing correlation |
expected_fwd_corr | double | the pair's forward-correlation forecast |
confirmed | bool | structurally confirmed (always true for ranked peers) |
5. Point-in-time rule
Every layer is constructed as-of trade_date with no look-ahead:
- Substrate — co-mention is measured over the trailing 90 days ending
trade_date;trailing_corruses returns up to and includingtrade_date. - Lamps — product versions and ownership enter as-of
trade_date; the disclosed lamp uses only financial-statement relationships with an availability date in(trade_date − 2yr, trade_date], rebuilt point-in-time (not accumulated), so a relationship that has gone quiet drops out. Tender/bid awards are excluded — they do not forecast co-movement out-of-sample (whitepaper §8). expected_fwd_corruses coefficients fitted on the develop window only (2017–mid-2022), so a holdout pair's forecast never sees holdout data.
No forward field is ever shipped: forward correlation exists only inside the replication harness, where it is computed at analysis time to validate the feed.
6. Construction notes a buyer should know
- Universe character. The feed concentrates on well-covered names — small caps are not co-mentioned heavily enough to enter the graph with signal. This is why the hedging edge is large-cap (whitepaper §8): the architecture and the scope boundary come from the same fact.
confirmed/darkis the hero field. It exists only because the news substrate gives an observed co-movement to certify. Confirmed correlations retain ~92% of their level a quarter forward vs ~75% for price-screened pairs.expected_fwd_corris a structural forecast, not a blacklist penalty. For a dark pair it falls back to the no-structure baseline; the blacklist lives in thedarkflag and the retention evidence, not in a negative forecast.- Risk, not alpha. The graph forecasts how names move together, not which way. A return-spillover signal on the same graph is flat-to-negative out-of-sample (whitepaper §8).
- Not a global risk model. Adding the graph to a whole-universe minimum-variance optimizer does not reduce realized variance; the value is targeted (single-name hedging, pairs, concentrated-position risk).
7. Quickstart
import pandas as pd, pyarrow.dataset as pads
edges = pads.dataset(f"{DATA}/comovement_edges", format="parquet",
partitioning="hive").to_table().to_pandas()
# the graph usable at time t (latest vintage on/before t):
t = pd.Timestamp("2026-04-30")
g = edges[edges.trade_date == edges.loc[edges.trade_date <= t, "trade_date"].max()]
confirmed = g[g.confirmed] # trust these correlations
blacklist = g[g.dark] # discount these — co-moving for no structural reason
peers = pd.read_parquet(f"{DATA}/comovement_peers/T={t:%Y-%m-%d}.parquet")
hedge_basket = peers[peers.ts_code == "600519.SH"].peer_ts.tolist() # top-10 graph hedge
To reproduce the whitepaper, run scripts/verify_outputs.py (reference codebase,
Numinor-Systems/comovement-codebase) → "✓ matches whitepaper".
8. Versioning
| Component | Version |
|---|---|
| Feed schema | 1.0 |
| Methodology | Whitepaper v1.3 |
| Forecast model | develop-fit, shipped as model_coefficients.json |
| Codebase / reference impl | comovement-codebase ≥ 1.0.0 (MIT) |
Schema changes bump the schema version and are announced ahead of effect.
Numinor Systems Limited · support@numinor.io