WBG
From open data to AI-ready data: Building the foundations for responsible AI in development

The production and use of development data have undergone significant transformation over the past two decades. The shift from paper-based records to digital formats has made data more accessible and easier to share. The open data movement has dramatically increased the availability of government and institutional datasets, which in turn catalyzed greater opportunities for analysis, transparency, and innovation. And major advances in big data and data science have further expanded both the volume and diversity of information guiding development policy.

Amid rapid advances in artificial intelligence (AI), development data has now reached another pivotal juncture: the evolution to AI-ready development data—data that is readily discoverable, comprehensible, accessible, and usable by both humans and AI applications.

Why AI-Ready Data?

AI, particularly large language models (LLMs), is completely transforming the way people interact with data. Data users at all levels of experience and expertise—from first-timers to power users—are now able to pose complex questions in natural language to chatbots, to which they expect to promptly find, interpret, and present data-driven insights packaged as pithy, accurate responses.

For this evolution to be successful, AI systems need to get it right. This means the data being accessed and interpreted by AI systems must first be evaluated, validated, structured, governed, and shared in ways that support the responsible and effective use of AI. In short, the data must be “AI-ready.” 

AI-ready data does not supplant earlier advancements, foundational concepts, or standards—such as the Fundamental Principles of Official Statistics, open data frameworks, or the FAIR (Findable, Accessible, Interoperable, and Reusable) principles—but rather it builds on them. By extending established foundations and standards, AI-ready data means that development data is continuously open, discoverable, and reusable, while ensuring that it is systematically organized and well-documented, to facilitate seamless use by both people and AI systems. Ensuring AI-readiness can thus shorten the distance between development data decision-making for better policies and faster innovation, democratizing development insights. The World Bank, in its efforts to become a bigger, better “Data Bank,” is already working to make this happen, in partnership with country partners and the global development community.

The Case for AI-Ready Data

Generative AI has emerged as a key interface for individuals seeking information, including on development-related topics. Platforms such as Google’s AI Overviews, Microsoft’s Bing, Perplexity.AI, and OpenAI’s ChatGPT comb through the internet and combine different sources of information to generate responses to user queries. The challenge, of course, is that AI responses are only as authoritative as the data that feed them. And the reality is that these systems frequently draw upon general internet content (including unproven sources) or web search results, rather than prioritizing authoritative data sources like the World Bank or national statistical offices.

Since current AI systems often select suboptimal development data sources, users regularly encounter outdated or incorrect responses, even when accurate information is otherwise available. This is problematic, since most AI responses have the appearance of providing authoritative information, even as they hallucinate.

It is important to emphasize that high-quality, authoritative development data is not scarce. In other words, AI tools do not need to rely on suboptimal data sources to form responses to queries about development topics. What is missing is a standardized framework and robust infrastructure to enable AI tools to consistently find, access, and use reliable development data from trusted sources to deliver accurate answers to user questions.

AI-ready development data can help overcome this information integrity problem. It is possible to enable seamless AI access to and use of trusted development data through the adoption of interoperability protocols and standards by governments, international organizations, and the private sector. Doing so will help support evidence-based decision-making, enhance public access to reliable information, and can promote trust in authoritative sources of development data and statistics.

What Makes Data “AI-Ready?”

AI-ready development data is systematically organized and thoroughly documented to ensure its meaning and context are clear not only for subject matter experts, but also for general users and AI systems.

Three core pillars define AI-ready development data:

  1. AI-Ready Data Systems: The foundational infrastructure—encompassing discovery platforms, APIs, and technical standards—ensures that data is not only stored but also readily discoverable, interoperable, and accessible.
     
  2. High-Quality Data and Metadata: Reliable, up-to-date, and thoroughly documented data, accompanied by comprehensive and structured metadata. For AI applications, this entails datasets that are systematically organized and described with sufficient specificity to ensure both machine and human analysts can accurately interpret the information.
     
  3. Robust Governance and Strategic Partnerships: The implementation of comprehensive policies, standardized procedures, and collaborative efforts across sectors is essential to ensuring data integrity, enhancing transparency, and advancing responsible utilization. These measures are fundamental to cultivating public trust among both human and AI stakeholders.

By leveraging these foundational elements, development data becomes an accessible asset to all stakeholders. AI-ready data is positioned to enhance public access, enable advanced insights through AI, and facilitate more rapid and informed decision-making throughout society.

Making AI-Ready Data a Reality

To operationalize these foundational pillars, we must translate principles into actionable steps. Development data encompasses several forms, including indicators, microdata, and geographic datasets. While the following recommendations can be adapted to different types of data, they are especially tailored for indicators.

1. AI-Ready Data Systems

  • Data Discovery: Incorporate both semantic and lexical search capabilities to enable users and AI systems to identify relevant data based on meaning as well as keywords. Provide support for multilingual search and ensure that results are accessible in machine-readable formats via APIs.
  • Data Accessibility: Implement open, machine-actionable standards such as SDMX, accompanied by comprehensive API documentation and robust metadata, enabling AI systems to efficiently interpret and integrate data. Ensure that data is made available under permissive open data licenses.
  • AI Interoperability: Employ open standards, such as the Model Context Protocol (MCP), to enable AI systems to efficiently identify and interpret reliable data sources. Ensure transparency and maintain oversight regarding data context and utilization.

The World Bank’s Development Data Group and Office of the Chief Statistician is actively making investments in these domains, including the piloting of advanced search tools, developing embedding models for low-resource contexts, integrating APIs, and the development of an MCP server to support the new Data360 platform and other selected datasets. 

2. High-Quality Data and Metadata

  • Data Quality Assurance: Conduct comprehensive validation of data throughout all stages, employing automated verification processes alongside anomaly detection methodologies. Ensure staff receive thorough training in data quality management, as robust data assurance is critical for both human and AI-based analyses.
  • Multiple Dissemination Formats: Provide data in a range of open formats, including CSV, Parquet, Arrow, JSON, and APIs, to accommodate diverse user requirements and facilitate seamless integration into AI workflows.
  • Use of Metadata Standards: Apply international metadata standards and keep all dataset metadata current and detailed.
  • Establish Robust Metadata Standards: Formulate and implement comprehensive, field-specific guidelines for generating structured metadata, utilizing AI-driven tools to perform automated quality assurance and enhancement processes.
  • Management Tools: Invest in advanced platforms and technologies that enable data and metadata validation, as well as robust data and metadata management at scale, by leveraging artificial intelligence solutions.

The World Bank, through its Data Quality and AI for Data / Data for AI work programs, advances these initiatives by providing open-source resources, including the Metadata Editor, comprehensive guidelines for creating high-quality metadata, and pilot frameworks that leverage AI to efficiently assess and enhance the quality of metadata

3. Governance and Partnerships

  • Policy Compliance and Accountability: Implement robust policies that promote rigorous standards for data and metadata quality, transparency, and open access. Regularly monitor third-party usage and establish effective feedback mechanisms.
  • Ethics and Privacy: Integrate ethical considerations and privacy safeguards into every stage of data handling, perform comprehensive impact assessments, and ensure transparency regarding analytical methodologies and data sources.
  • International Collaboration: Facilitate the harmonization of standards and terminology through coordinated initiatives, enhance technical assistance processes, and develop comprehensive shared tools and resources.
  • Engagement with the Private Sector: Foster collaborative partnerships with technology firms to promote the development of AI tools grounded in reliable and well-governed data. Initiate joint pilot projects, disseminate established best practices, and advocate for increased transparency across all initiatives. Additionally, support the creation of low-resource AI solutions to ensure accessibility for organizations facing significant resource limitations.

The World Bank is establishing partnerships among international organizations, including the United Nations Statistical Commission, the IMF, the OECD, and the African Development Bank (AfDB), countries, and the private sector to promote governance and the adoption of global standards and mechanisms for effectively managing and using development data to work with AI systems. 

Why is AI-Readiness for Development Data Unique?

Development data differs from most private sector data as it must meet the needs of diverse users, including governments, organizations, researchers, civil society, businesses, and the public. Treated as public intent data, it requires openness, transparency, and accountability. Since development data influences policy and investment decisions across countries and systems, interoperability, and thorough documentation are essential. 

A Call to Action

The transition to AI-ready development data is both urgent and extensive. Realizing this objective will necessitate:

  • Investment in data infrastructure, skills development, and the adoption of global standards related to data systems, metadata, and governance.
  • Cooperation among governments, international organizations, and the private sector to facilitate the exchange of best practices and maintain strategic alignment.
  • Continuous innovation and flexibility, given the evolving nature of AI technologies and user requirements.

We encourage national statistical offices, data producers, policymakers, and technology partners to participate in this initiative. Through collaborative effort and the necessary adoption of global data quality standards, we can ensure that development data continues to serve as a reliable, inclusive, and robust resource for the public good as we progress into the Age of AI.

Let us work collectively to prepare development data for the future and ensure its benefits are accessible to all.