Numinor Co-Movement Graph Construct Data — Data Dictionary v1.0

SKU comovement-graph-v1 · Methodology Numinor Co-Movement Graph Whitepaper v1.3 · Canonical repo Numinor-Systems/comovement-codebase (MIT) · June 2026

1. What this is

A point-in-time, pairwise relationship feed over China A-shares that forecasts which names genuinely co-move — and flags which observed correlations are spurious. It is built in two layers:

The substrate — the news co-movement graph. Every pair in the feed is one the market is co-mentioning in news over the trailing 90 days. This is the candidate universe: broad, timely, and by itself noisy.
The overlay — the structural grading. Each co-mentioned pair is annotated with four structural "lamps" (deep product peer, SAM supply chain, disclosed customer–supplier, affiliate). A pair with ≥1 lamp is confirmed; a pair with none — co-mentioned but structurally unexplained — is dark (the blacklist).

The product is a risk/correlation layer, not a return signal and not a portfolio. It ships as two tables: the edge feed (the graph) and a derived per-stock peer set (the hedging / comparables view). Every number in the whitepaper is reproduced from these tables by the reference codebase (verify_outputs.py → "✓ matches whitepaper").

This is not a structural-relationship map. A network-only feed would list every supply-chain or affiliate pair across ~6,000 names — most dormant, and silent on which observed correlations are spurious. By scoping to the news substrate, the feed keeps only the links the market is currently acting on, and gains the one field a static map cannot have: the blacklist.

2. Delivery and layout

Follows the Numinor Construct Data standard (one live store, all tiers; see NUMINOR_CONSTRUCT_DATA_SPEC.md):

$COMOVEMENT_DATA_DIR/comovement_edges/T=YYYY-MM-DD.parquet   (the edge feed — the graph)
$COMOVEMENT_DATA_DIR/comovement_peers/T=YYYY-MM-DD.parquet   (the per-stock peer set)
$COMOVEMENT_DATA_DIR/model_coefficients.json                 (the forecast model)

Cadence: a daily point-in-time snapshot (the Ops-Desk daily cron, scripts/cron/run_cron.py). The live feed publishes one partition per trading day to s3://numinor-construct-data/comovement/parquet/year=/month=/day=/data.parquet (edges) + comovement/peers/... (peer set); the trailing-90d co-mention window and trailing correlation roll each day, so today's partition is the graph as-of today. The research vintage above (comovement_edges/T=YYYY-MM-DD.parquet, one file per month-end) is what reproduces the whitepaper — the daily feed's month-ends are exactly those vintages.

The PIT contract is the trade_date column. A partition holds the graph as it stood that day; to read the graph usable at backtest time t, take the latest partition with trade_date <= t.

3. Table A — `comovement_edges` (the edge feed)

One row per (trade_date, ts_a, ts_b) news-co-mention A-share pair (~54k pairs/vintage over ~5,450 names). Columns are ordered substrate → overlay → output → scope.

Layer	Column	Type	Meaning
key	`trade_date`	date	as-of month-end (the PIT contract)
key	`ts_a`, `ts_b`	string	the A-share pair, canonical order `ts_a < ts_b` (`NNNNNN.SH/.SZ/.BJ`)
substrate	`comention_strength`	int32	news co-mention weight over the trailing window (how heavily the two are linked in the news)
substrate	`comention_days`	int32	distinct days the pair was co-mentioned in the window
substrate	`trailing_corr`	double	realized 90-day return co-movement (the normalized measure of §3.2 of the whitepaper)
overlay	`lamp_product_peer`	bool	the two firms share a product (any depth on the SAM tree) — the shallow-peer base
overlay	`peer_depth`	int8	deepest shared level on the SAM product tree (≥4 = same specific product)
overlay	`lamp_deep_peer`	bool	`peer_depth >= 4` — the standout forecasting lamp
overlay	`lamp_sam_chain`	bool	SAM supply-chain link (core inputs), product-derived and continuously available
overlay	`lamp_disclosed`	bool	financial-statement disclosed customer–supplier, active within 2 years, bids excluded
overlay	`lamp_affiliate`	bool	common ownership / cross-holding
output	`n_lamps`	int8	how many of the four lamps are lit (the relatedness gate)
output	`confirmed`	bool	`n_lamps >= 1` — structurally explained co-movement
output	`dark`	bool	`n_lamps == 0` — co-mentioned, structurally unconfirmed (the discount flag)
output	`tier`	string	`deep_peer` / `multi` / `single` / `dark` (the grading ladder)
output	`expected_fwd_corr`	double	forward-correlation forecast = `const + trailing·c + dark·c + Σ lamp·c` (develop-fit coefficients)
output	`corr_delta`	double	`expected_fwd_corr − trailing_corr` — where the graph disagrees with the price screen
scope	`cap_tier`	string	`CSI300` / `CSI500` / `CSI1000` / `other` — pair is in a tier iff both names are members (the hedging edge is large-cap; §8)

4. Table B — `comovement_peers` (the per-stock peer set)

Derived from Table A: each name's top-10 structural peers, the raw material for a hedge basket or a comparables set. One row per (trade_date, ts_code, peer_rank).

Column	Type	Meaning
`trade_date`	date	as-of month-end
`ts_code`	string	the focal A-share name
`peer_rank`	int	1 = closest peer (rank by relatedness, then co-mention strength)
`peer_ts`	string	the peer name
`peer_score`	int	relatedness score = `2·deep_peer + 2·disclosed + n_lamps`
`tier`	string	the pair's grading tier
`trailing_corr`	double	the pair's trailing correlation
`expected_fwd_corr`	double	the pair's forward-correlation forecast
`confirmed`	bool	structurally confirmed (always true for ranked peers)

5. Point-in-time rule

Every layer is constructed as-of trade_date with no look-ahead:

Substrate — co-mention is measured over the trailing 90 days ending trade_date; trailing_corr uses returns up to and including trade_date.
Lamps — product versions and ownership enter as-of trade_date; the disclosed lamp uses only financial-statement relationships with an availability date in (trade_date − 2yr, trade_date], rebuilt point-in-time (not accumulated), so a relationship that has gone quiet drops out. Tender/bid awards are excluded — they do not forecast co-movement out-of-sample (whitepaper §8).
expected_fwd_corr uses coefficients fitted on the develop window only (2017–mid-2022), so a holdout pair's forecast never sees holdout data.

No forward field is ever shipped: forward correlation exists only inside the replication harness, where it is computed at analysis time to validate the feed.

6. Construction notes a buyer should know

Universe character. The feed concentrates on well-covered names — small caps are not co-mentioned heavily enough to enter the graph with signal. This is why the hedging edge is large-cap (whitepaper §8): the architecture and the scope boundary come from the same fact.
confirmed/dark is the hero field. It exists only because the news substrate gives an observed co-movement to certify. Confirmed correlations retain ~92% of their level a quarter forward vs ~75% for price-screened pairs.
expected_fwd_corr is a structural forecast, not a blacklist penalty. For a dark pair it falls back to the no-structure baseline; the blacklist lives in the dark flag and the retention evidence, not in a negative forecast.
Risk, not alpha. The graph forecasts how names move together, not which way. A return-spillover signal on the same graph is flat-to-negative out-of-sample (whitepaper §8).
Not a global risk model. Adding the graph to a whole-universe minimum-variance optimizer does not reduce realized variance; the value is targeted (single-name hedging, pairs, concentrated-position risk).

7. Quickstart

import pandas as pd, pyarrow.dataset as pads

edges = pads.dataset(f"{DATA}/comovement_edges", format="parquet",
                     partitioning="hive").to_table().to_pandas()

# the graph usable at time t (latest vintage on/before t):
t = pd.Timestamp("2026-04-30")
g = edges[edges.trade_date == edges.loc[edges.trade_date <= t, "trade_date"].max()]

confirmed = g[g.confirmed]          # trust these correlations
blacklist = g[g.dark]               # discount these — co-moving for no structural reason
peers = pd.read_parquet(f"{DATA}/comovement_peers/T={t:%Y-%m-%d}.parquet")
hedge_basket = peers[peers.ts_code == "600519.SH"].peer_ts.tolist()   # top-10 graph hedge

To reproduce the whitepaper, run scripts/verify_outputs.py (reference codebase, Numinor-Systems/comovement-codebase) → "✓ matches whitepaper".

8. Versioning

Component	Version
Feed schema	1.0
Methodology	Whitepaper v1.3
Forecast model	develop-fit, shipped as `model_coefficients.json`
Codebase / reference impl	`comovement-codebase` ≥ 1.0.0 (MIT)

Schema changes bump the schema version and are announced ahead of effect.

Numinor Systems Limited · support@numinor.io