Tag: Non-Contentful

  • How Do You Like Your Metadata?

    How Do You Like Your Metadata?

    [Metadata-First: The Future of Data is now available in all three formats on Kindle: Ebook, paperback, and hardcover. Regardless of which you choose, treating metadata differently may change your perspective and approach.]

    A metadata-first approach treats context, lineage, and quality as first-class assets, improving decision speed, lowering total cost of ownership, and reducing operational and regulatory risk. Depending on your role, you may like your metadata differently:

    C-level Executive: Portfolio-Level Control and Risk Reduction

    • Improve time-to-decision on key metrics by giving your teams a single, trusted view of definitions, lineage, and quality. Typical impact: 20–40% faster quarterly reporting cycles and board materials preparation, e.g., cutting a 10-day close/report cycle to 6–8 days.
    • Reduce write-offs from bad data and model errors by making policy, ownership, and quality thresholds explicit in metadata. Organizations with strong data contracts and lineage typically see 30–50% fewer material incidents per year and avoid 1–3 large remediation projects annually.
    • Lower TCO of your data and AI estate by pruning redundant assets and overlapping feeds. A governed catalog with usage metadata often identifies 10–25% of pipelines and tables as low- or no-use, enabling decommissioning that can cut infrastructure and license costs by 5–15%.
    • Make regulatory conversations faster and cheaper by tracing critical KPIs and model outputs back to their sources in minutes, not weeks. Typical effect: reduce external audit and regulatory inquiry effort by 25–40%, saving hundreds of hours per major review.
    • Allocate capital more effectively across data and AI initiatives by measuring usage, impact, and reliability at the asset level. Teams using metadata-based value tracking often reallocate 15–30% of their data/ML budget to higher-ROI projects within the first 12–18 months.

    Technology Manager: Simpler Estate, Lower Run Costs

    • Rationalize platforms and tools using objective usage and lineage data instead of anecdotes. With rich operational metadata, many organizations consolidate 2–3 overlapping tools per domain, reducing integration and support spend by 10–20% over 2 years.
    • Cut incident resolution time by giving SRE and data reliability teams full visibility into upstream dependencies. Teams with end-to-end lineage typically reduce “where did this break?” time from hours to minutes, leading to 30–50% faster MTTR for data and ML incidents.
    • Reduce unplanned work from uncontrolled schema and API changes by enforcing contracts and performing impact analysis using metadata. Typical range: 20–35% fewer “surprise” breaking changes and 10–25% less emergency weekend work for platform and data teams.
    • Lower environment sprawl and idle resource cost by tagging workloads, datasets, and models with ownership, criticality, and lifecycle state. This frequently identifies 15–30% of compute and storage as orphaned or underused, enabling meaningful rightsizing without business disruption.
    • Improve planning accuracy for migrations and re-platforming by relying on actual lineage and usage stats. Programs that use metadata to scope dependencies often shorten migration timelines by 15–25% and reduce contingency budgets by 10–20% by surfacing hidden dependencies early.

    Data Developer: Faster Delivery with Less Rework

    • Shorten feature and pipeline development cycles by starting from discoverable, well-documented assets rather than rebuilding from scratch. Teams with robust catalogs and schema documentation commonly see 20–40% reduction in development time for new data products (e.g., 5 days instead of 8–9).
    • Cut debugging and triage time by using lineage and data quality metadata to quickly identify where a value changed and why. This typically reduces “where did this go wrong?” investigations from days to hours, saving ~4–8 hours per developer per week.
    • Reduce rework from misaligned definitions and undocumented assumptions by encoding contracts, business rules, and owner expectations into metadata. Organizations that adopt this pattern often observe 15–30% fewer reopened tickets and failed UAT cycles for data changes.
    • Improve reuse of transformations and metrics by making them searchable and composable via semantic and technical metadata. Example: teams can re-use 30–50% of transformations for new use cases, instead of creating bespoke logic each time, which can halve the time to add a new report.
    • Lower onboarding time for new data engineers and analytics engineers by providing a navigable map of data assets, dependencies, and standards. Typical impact: ramp-up for new hires drops from 3–4 months to 6–8 weeks, a 30–50% productivity gain in the first year.

    Quant: More Time on Signal, Less on Plumbing

    • Increase research throughput by minimizing time spent hunting for and validating data. With curated, well-documented datasets and a clear quality history, quants often reclaim 20–30% of their week (e.g., 1–1.5 days) that was previously spent on ad hoc data validation.
    • Reduce model slippage due to untracked data changes by linking every feature to its source lineage, distributions, and quality checks. This can cut P&L impact from data drift events by 25–40% and reduce “silent” model performance degradation windows from months to weeks.
    • Accelerate backtesting cycles with reproducible, versioned data snapshots and explicit metadata about exclusions, corrections, and survivorship bias. Typical effect: 30–50% faster re-runs when assumptions change, enabling more variants to be tested within the same research budget.
    • Lower operational risk in production models by providing risk and compliance teams with traceable explanations of inputs and transformations. This can reduce friction in model approval processes, shortening review timelines by 20–35% while still meeting model risk governance requirements.
    • Make cross-desk collaboration feasible by standardizing definitions of instruments, counterparties, and events via shared semantic metadata. Firms that do this report 10–20% fewer reconciliation breaks between desks and reduce manual adjustment volumes by similar ranges.

    AI Engineer: Faster Experiments, Safer Deployment

    • Increase experiment throughput by standardizing how datasets, features, and model versions are described and discovered. Teams that invest in feature and model registries typically see 25–50% more experiments per quarter without increasing headcount.
    • Reduce time spent on data and feature wrangling by reusing well-governed features across models. Example: a shared feature store with strong metadata can enable 40–60% feature reuse, cutting new model development cycles from months to weeks in some domains.
    • Lower risk of model failures in production by tracking lineage from raw data to model artifacts, including training/serving skew checks. Organizations that operationalize this often reduce major model incidents by 30–50% and shorten incident recovery time from days to hours.
    • Improve compliance with emerging AI regulations by attaching policy metadata (e.g., allowed use, retention limits, consent requirements) directly to datasets and models. This can cut AI risk review times by 20–35% per model and reduce the chance of non-compliant deployments.
    • Enable safer, faster rollouts through metadata-driven canarying, shadow deployments, and rollback plans tied to specific model versions. Teams using this pattern often deliver 15–25% more production model updates per year while holding incident rates flat or lower.

    Conclusion: One Discipline with Shared Gains

    A metadata-first approach creates a shared, accurate map of your data and model landscape.

    Executives gain clearer risk and ROI signals, managers simplify estates and reduce run costs, and practitioners spend more time on analysis and model design rather than plumbing and forensics.

    The practical next step is straightforward: Assess current metadata coverage and quality, identify 2–3 high-impact use cases (e.g., a critical regulatory report, a flagship trading or pricing model, or a core customer KPI), and run a 60–90 day pilot.

    Use concrete metrics such as cycle time, incident counts, and run cost to measure impact, then scale the patterns that demonstrably improve both speed and safety.

    Failure is expense; some estimate that poor data quality costs the U.S. economy $3.1 trillion per year.[1] A metadata-first approach treats context, lineage, and quality as first-class assets so that decisions can be made on clearer definitions, better-tracked lineage, and observable quality, which can in turn reduce avoidable cost and operational and regulatory risk.

    C-level Executive: Portfolio-Level Control and Risk Reduction

    • Can improve time-to-decision on key metrics by giving your teams a single, trusted view of definitions, lineage, and quality. When “what the number means” is clear, executives can focus discussions on trade-offs instead of reconciling dashboards.
    • Can help reduce write-offs from bad data and model errors by making policy, ownership, and quality thresholds explicit in metadata. According to Gartner, “Poor data quality costs organizations an average of $12.9 million every year.”[2] A metadata-first program is one of the levers that can help reduce your share of that loss.
    • Can lower total cost of ownership for your data and AI estate by identifying redundant assets and overlapping feeds. A governed catalog with usage metadata makes it easier to decommission low- or no-use datasets, pipelines, and models without surprises.
    • Can make regulatory conversations faster and more predictable by being able to trace critical KPIs and model outputs back to sources in a repeatable way. Clear lineage and control evidence reduce the back-and-forth required for major reviews.
    • Can help allocate capital more effectively across data and AI initiatives by measuring usage, impact, and reliability at the asset level. Metadata gives you a more objective view of which products are truly used and dependable, and which are candidates for sunset.

    Technology Manager: Simpler Estate, Lower Run Costs

    • Helps rationalize platforms and tools using objective usage and lineage data instead of anecdotes. Operational metadata makes it clear which systems carry critical workloads and which can be consolidated or retired.
    • Can cut incident resolution time by giving SRE and data reliability teams full visibility into upstream dependencies. When a dashboard breaks, teams can immediately see which pipelines, tables, or services are involved instead of hunting through logs and tribal knowledge.
    • Can reduce unplanned work from uncontrolled schema and API changes by enforcing contracts and impact analysis via metadata. Producers and consumers see, in advance, what will be affected by a change and can coordinate rollouts rather than firefighting.
    • Can lower environment sprawl and idle resource cost by tagging workloads, datasets, and models with ownership, criticality, and lifecycle state. Clear tags support deliberate cleanup rather than one-off cost-cutting exercises.
    • Can improve planning accuracy for migrations and re-platforming by relying on actual lineage and usage statistics. With a grounded view of dependencies, you can scope moves realistically and avoid last-minute surprises.

    Data Developer: Faster Delivery with Less Rework

    • Can shorten feature and pipeline development cycles by starting from discoverable, well-documented assets rather than rebuilding from scratch. A good catalog lets developers build on what exists instead of reverse-engineering legacy code and schemas.
    • Can cut debugging and triage time by using lineage and data quality metadata to quickly identify where a value changed and why. Instead of scanning multiple systems, developers can follow a clear trail from report back to source.
    • Can reduce rework caused by misaligned definitions and undocumented assumptions by encoding contracts, business rules, and owner expectations in metadata. When expectations are explicit, fewer changes bounce back from downstream consumers.
    • Can reduce time spent searching for inputs and examples by using catalogs, registries, and documentation tied to real usage. As one Forrester study summarized by the CDP Institute put it, “knowledge workers lose 30% of their time just looking for data”;[3] better metadata directly attacks that waste.
    • Can lower onboarding time for new data engineers and analytics engineers by providing a navigable map of data assets, dependencies, and standards. New team members can become productive using self-service context instead of relying solely on handoffs.

    Quant: More Time on Signal, Less on Plumbing

    • Can increase research throughput by minimizing time spent hunting for and validating data. With curated, well-described datasets and clear quality history, quants can focus more on signal generation and less on plumbing.
    • Can help reduce model slippage due to untracked data changes by linking every feature to its source lineage, distributions, and quality checks. When inputs change, the impact on live strategies is easier to spot and explain.
    • Can accelerate backtesting cycles with reproducible, versioned data snapshots and explicit metadata about exclusions, corrections, and survivorship bias. Changing an assumption becomes a controlled rerun instead of an ad hoc rebuild.
    • Can lower operational risk in production models by giving risk and compliance teams traceable explanations for inputs and transformations. Clear documentation and lineage reduce friction in model approval and review.
    • Can make cross-desk collaboration more feasible by standardizing definitions of instruments, counterparties, and events via shared semantic metadata. Fewer semantic disputes mean fewer reconciliation breaks and manual adjustments.

    AI Engineer: Faster Experiments, Safer Deployment

    • Can increase experiment throughput by standardizing how datasets, features, and model versions are described and discovered. With consistent registries and catalogs, engineers spend less time re-discovering assets and more time running useful tests.
    • Can reduce time spent on data and feature wrangling by reusing well-governed features across models. A shared feature store with strong metadata turns common signals into shared building blocks.
    • Can lower the risk of model failures in production by tracking lineage from raw data to model artifacts, including training/serving skew checks. When something fails, teams can quickly see which data, code, and configuration were involved.
    • Can improve compliance with emerging AI regulations by attaching policy metadata (for example, allowed use, retention limits, and consent requirements) directly to datasets and models. Enforcement is then driven by facts in the catalog, not only by process documents.
    • Can enable safer and faster rollouts via metadata-driven canarying, shadow deployments, and rollback plans tied to specific model versions. Clear metadata about dependencies and blast radius makes it easier to roll out and roll back without unexpected side effects.

    Conclusion: One Discipline, Shared Gains

    A metadata-first approach aims to create a shared, accurate map of your data and model landscape. Executives can gain clearer risk and ROI signals, managers can simplify estates and lower run costs, and practitioners can spend more time on analysis and model design instead of plumbing and forensics.

    The practical next step is straightforward: assess current metadata coverage and quality, identify a small number of high-impact use cases (for example, a critical regulatory report, a flagship trading or pricing model, or a core customer KPI), and run a focused pilot. Use concrete observations—changes in cycle time, incident patterns, and run cost—to measure impact, then scale the patterns that demonstrably improve both speed and safety.

    References

    [1] Thomas C. Redman, “Bad Data Costs the U.S. $3 Trillion Per Year,” Harvard Business Review, September 22, 2016. Available at: https://hbr.org/2016/09/bad-data-costs-the-u-s-3-trillion-per-year

    [2] Gartner, “Data Quality: Why It Matters and How to Achieve It,” Gartner Research (Data and Analytics). Quoted line: “poor data quality costs organizations an average of $12.9 million a year.” Available at: https://www.gartner.com/en/data-analytics/topics/data-quality

    [3] CDP Institute, “Knowledge Workers Lose 30% of Time Looking for Data: Forrester Study,” summarizing a Forrester study finding that “knowledge workers lose 30% of their time just looking for data.” Available at: https://www.cdpinstitute.org/news/knowledge-workers-lose-30-of-time-looking-for-data-forrester-study/

  • Contentful vs. Non-Contentful Code

    Contentful vs. Non-Contentful Code

    Contentful Code

    “Contentful” code understands the business, while “non-contentful” code doesn’t know anything about the business itself (although it does know where to look).

    Today’s contentful code works but has some issues:

    • Training is expensive and time-consuming: Even for those with domain knowledge, new developers need several months to learn the business well enough to be productive.
    • Turnaround time, risk, and costs for new development and changes are high: Because business rules and data quality checks are spread out across several code bases and lack a cohesive, consistent structure, impact analysis takes time, testing is laborious, and code changes often introduce second-order bugs.

    Shouldn’t code know about the business? Isn’t that its purpose? How could it not?

    Business systems need to know about code but data systems do not: The purpose of data code is to perform data operations with low risk and cost while maintaining good governance, quality, architecture, integrity, trust, traceability, security, and reportability.

    Many IT and data professionals (leaders, managers, stewards, engineers, etc.) make a hidden – and incorrect – assumption: that code which moves and transforms data must embed the details of how a business operates. They assume it must include the names, attributes, and behaviors of the business’s source systems, people, processes, databases, tables, and columns, as well as the specific business rules, data quality rules, and transformation rules to apply.

    Can you imagine how such code might not contain this information?

    As John Lennon famously sang in his song “Imagine,” “It isn’t hard to do.”

    One way is to examine contentful code and identify its business content. For example, code that extracts and loads data from Salesforce might refer to an Opportunity object. That’s contentful: The code refers to a specific entity called Opportunity in the Salesforce source system.

    Non-contentful Code

    Now imagine instead fetching Salesforce table names from a data catalog, looping through them, and then looping through each table’s attributes. You’ve made your code significantly less contentful: It no longer needs to know or refer to table or column names. It still refers to the Salesforce source system but you could further rewrite the code to loop through a list of source systems and thereby remove the source system name from the code as well.

    This blog post introduces these new terms: “contentful” and “non-contentful.” When socialized, they form a valuable shorthand for those working in Metadata-First environments.