NuminorBeta

Numinor C2C Supply-Chain Construct Data — Data Dictionary v1.0

SKU c2c-construct-v1 · Methodology Numinor C2C Supply-Chain Whitepaper v3.0 · Canonical repo Numinor-Systems/c2c-codebase (MIT) · June 2026


1. What this is

A daily-refreshed, point-in-time edge table of observed company-to-company supply-chain relationships between Chinese A-share listed companies. Two observation channels, one table:

  • disclosed — mandatory periodic-report disclosures: top-5 customers / top-5 suppliers and related-party transactions (reporting periods from 2015);
  • bid — awarded procurement contracts (winning-bid announcements, from 2020).

Both channels are resolved to listed-company identity on both sides through ChinaScope's structured affiliate-ownership tables — subsidiaries and operating entities roll up to their listco parents with real ownership ratios. No language model is involved anywhere in the construction; the rollup is deterministic joins over structured identifier tables.

The product is the graph, not a signal. The customer-momentum signal validated in the whitepaper (union orthogonal ICIR +0.470 full / +0.394 OOS, t = 2.8) is one construction over this table; the same edges support supplier-side spillover, concentration and counterparty-risk measures, network centrality, and shock-propagation studies.

2. Delivery and layout

s3://numinor-construct-data/c2c/parquet/year=YYYY/month=MM/day=DD/data.parquet
                                                              data.csv.zip   (same columns)
s3://numinor-construct-data/c2c/_heartbeat.json                              (freshness)

One live store, all tiers — Sandbox/Realm computes in-kernel on exactly the data the API delivers; freshness is identical everywhere. Refresh is daily by 16:00 Asia/Shanghai, seven days a week.

Partitions are delivery batches, not the PIT clock. A partition holds the edges emitted that day (live: the day the records arrived from the source feed; historical backfill: the day the edge became public). Always filter on the eff_date column — never on partition dates. The stream is append-only: corrections arrive as new rows with the same edge_id and a CDC operation flag; the latest row per edge_id is the current state.

3. Schema (25 columns, in order)

#ColumnTypeMeaning
1edge_idstringStable identity: d:<record>:<sup>:<cus> (disclosed) / b:<bid>:<sup>:<cus> (bid). Latest row per edge_id wins.
2sourceenumdisclosed | bid
3relation_typeenumtrade (disclosed) | procurement_award (bid)
4supplier_tsstringSeller (focal) listco — NNNNNN.SH/.SZ/.BJ
5customer_tsstringBuyer listco
6relation_value_cnydoubleOwnership-adjusted economic value: raw value × both ownership ratios
7raw_value_cnydoublePre-adjustment source value (bid awards occasionally lack a parsed price)
8balance_value_cnydoubleOwnership-adjusted ending balance — disclosed only
9supplier_own_ratiodoubleListco ownership of the seller party (1.0 = the listco itself)
10customer_own_ratiodoubleSame, buyer side
11supplier_resolutionenumdirect | affiliate_rollup
12customer_resolutionenumdirect | affiliate_rollup
13raw_supplier_idstringChinaScope entity id of the party as disclosed
14raw_customer_idstring
15raw_supplier_namestringBid channel only (the disclosed source carries ids)
16raw_customer_namestring
17source_record_idstringThe underlying disclosure record / bid id — audit trail to the raw feed
18source_rpt_datedateReporting period (disclosed) / award date (bid)
19source_publish_datedateWhen the edge became public (see §4)
20source_basisenumfiling_publish | rpt_proxy | award_announcement
21eff_datedateThe PIT contract: source_publish_date + pit_buffer_days
22pit_buffer_daysint32The buffer baked into column 21 (product default 30)
23operationenumCDC flag as delivered (A/U/D)
24ingestion_datedateWhen the record arrived (live) / its publish-basis day (backfill)
25data_vintagestringOwnership-mapping snapshot (YYYYMMDD) used for the rollup

4. Point-in-time rule

eff_date = source_publish_date + pit_buffer_days        (default 30, dial 0–120)
  • Disclosed: source_publish_date = the earliest filing publication date of either resolved listco for that reporting period (source_basis = 'filing_publish'). Where no filing date resolves, a conservative report-period + 30d proxy applies, marked 'rpt_proxy'.
  • Bid: the award result/announcement date itself ('award_announcement').
  • The 30-day default decomposes as ~4 days of measured availability necessity (p99 delivery latency of the underlying feed, measured across a full filing season) plus ~26 days of buyer-workflow allowance. Because source_publish_date ships in every row, you can apply any buffer instantly — recompute eff_date from the basis; nothing needs rebuilding.
  • The whitepaper's research convention (report-date + 60d disclosed, +0d bid) is pinned in the frozen data package for bit-exact replication; the feed is the forward-looking product convention.
  • Naming note: whitepaper Appendix A.6 sketched these fields as available_date (its schema is explicitly representative); the production columns follow the platform-wide Date 1–4 audit standard — source_rpt_datesource_publish_dateeff_date, enforced row-exact by the validation gate.

5. Construction notes a buyer should know

  • Clean vintage discipline: one row per source record per genuine listco parent; the ownership mapping is deduplicated to the latest vintage per (entity, parent), and multi-parent entities legitimately yield one edge per parent, ownership-weighted. The production construction was re-run through the whitepaper's evaluation harness before launch: union orthogonal ICIR +0.465 / +0.381 vs the published +0.470/+0.394 — the finding is construction-robust.
  • Self-edges are excluded (both sides resolving to the same listco).
  • Ownership ratios are carried as delivered; rare source anomalies with ratios slightly above 1.0 exist and are not clipped (provenance-faithful; quantified in the release audit).
  • Negative source amounts (~0.15% of disclosed records — filing reversals/ corrections) are excluded: a negative relationship value has no meaning as a graph weight. Quantified in the release audit.
  • Coverage character: disclosed edges cluster at filing seasons (Apr/Aug/Oct); the bid channel is a steadier daily trickle from 2020. ~4,860 unique sellers reach the union construction historically (~half the tradable A-share universe per month-end).
  • Validation gate: every partition passes a schema/enum/identifier/PIT-identity/ degeneracy gate before upload; the heartbeat carries the latest validation block.

6. Quickstart

import pandas as pd, pyarrow.dataset as pads
from pyarrow import fs

ds = pads.dataset("numinor-construct-data/c2c/parquet", format="parquet",
                  partitioning="hive", filesystem=fs.S3FileSystem(region="ap-northeast-2"))
edges = ds.to_table().to_pandas()

# current state of the graph usable at time t (the only correct PIT read):
t = pd.Timestamp("2026-04-30")
live = (edges[edges["eff_date"] <= t]
        .sort_values("ingestion_date").groupby("edge_id").last()
        .query("operation != 'D'"))

To rebuild the whitepaper's momentum signal on top, use the MIT reference implementation (Numinor-Systems/c2c-codebase, notebooks 01–04) — build_disclosed, build_bid_band_median, standardize_union, orth_eval.

Ask Gandalf for help

Stuck on integration? Open Gandalf (your onsite AI assistant) and ask:

"How do I read the C2C edge table point-in-time as of a backtest date?" "How do I rebuild the whitepaper's union signal against my own factor book?" "What does source_basis = 'rpt_proxy' mean for my backtest?"

Gandalf has context on this dictionary, the reference code, and the methodology.

7. Versioning

ComponentVersion
Edge-table schema1.0
MethodologyWhitepaper v3.0
Frozen research vintagec2c-data-package @ 7cb44891 (immutable)
Codebase / reference implc2c-codebase ≥ 1.1.0 (MIT)

Schema changes bump the schema version and are announced ahead of effect; the frozen whitepaper vintage is never modified.


Numinor Systems Limited · Gandalf (onsite) · support@numinor.io