Time series data integration at scale

Introduction

Though most of us rarely if ever use the phrase “time series data” in our daily lives, we use the concept all the time. On the most basic level, we do it whenever we make a decision based on past experience: Should I buy this plane ticket now? Is this a good city to live in? Will this be a good market for my product? And although today “data-driven decision making” is prominent to the point of cliche, that too is nothing new: 5,500 years ago the Sumerians used clay tablets to track what today we refer to as economic development (trade deals, prices, harvests, etc.). Now, we are creating about 2.5 billion gigabytes of data daily, equal to 2.5 million fully-loaded 1-Terabyte iPads.

Why has the world gone so crazy about data?

One reason: More data means better models, which ideally means better decisions. After all, behind even what may seem to be the simplest of decisions, many questions need answering: What’s the current state of the world? How’d we get there? What might happen next? Time series data lets us nail down critical reference points, and the more the better—but that creates a new problem: The mountains of data we generate today can actually lead to complicated, contradictory, or faulty models and forecasts, which can make it harder and riskier to make informed decisions. This is where Knoema comes in.

What We Do – What We do Not

Knoema currently provides the most comprehensive integrated database of global data in the world. At the moment, we host more than 2.8 billion time series and 24,000 datasets published by more than 1,200 sources.

Yet, as not all data is created equal, Knoema takes on the tasks of structuring, validating, transforming, and maintaining data. Once we feel confident in our data, we then need to integrate it into a single platform to make it understandable, accessible and, above all, usable.

So at the risk of oversimplifying it, Knoema specializes in conflict resolution: our proprietary data management system pulls data from more than 1,000 sources, automatically validates it, and presents it in a single, smooth platform where all datasets are matched and uniformly formatted. We can also arrange customized datasets to support specific analytic projects and support an array of languages, including some outside the core U.N. languages.

What we do not do is nearly equally as important – we do not alter the data published by sources. The variables, units of measure, metadata and more are published on Knoema as is from the sources. This means you can be assured that what we publish mirrors that of the sources to every extent possible, and we provide a direct citation back to the original source data so you can verify at any time.

The Challenges of Creating the Knoema Database

Again, the mantra: Not all data is created equal. “Difficult” data—often locked in complex workbooks, scattered across multiple PDFs, or obscured behind arcane single-language systems—requires experience and ideally topical and/or source familiarity, which Knoema data engineers bring to the table. And even “friendly” data (accessible from comprehensive, modern APIs) isn’t always usable. It sometimes demands streamlining and standardization before clients can have quick, direct access. Here are a few of the most common challenges we encounter.

Bad APIs: APIs sometimes overpromise usability, which might lead users to wrongly blame themselves. Sometimes APIs are functionally incomplete (to limit usage), and some API archives are incomplete. All too often, APIs don’t standardize metadata/data in a universally accessible way, and others—such as the U.N. Comtrade data we publish weekly—don’t include revision information, which sometimes makes it impossible to figure out which historical data may have been changed or removed.

Troublesome PDFs: By design, PDFs offer a measure of control over presentation and create an official historical record. But PDFs can also create their own difficulties for those who are interested in the data they contain. For instance, the National Oceanic and Atmospheric Administration (NOAA) publishes worldwide weather data (see figure below) in free PDFs every month, but each massive file requires a considerable time commitment for someone who wants to visualize all that information. NOAA also doesn’t publish GPS coordinate format, a step we must add to increase functionality.

Another PDF barrier: They might present data in mixed formats (see Zambia’s 2013 open budget below). Though PDF readers can convert information from text to data formats, that technology isn’t yet perfect. In response, Knoema increases the share of data we review manually to validate the data for our users.

Language barriers: In some cases, sources publish datasets that contain more than one language. Japan’s monthly publication of ‘Consumption and Production of Chemical Industry by Chemicals’ is a mixed bag of trend symbols, abbreviations, non-standard date codes, and (partially) dual-language content. The complexity of normalizing these tables into a fully bilingual dataset exponentiates with monthly updates.

Maintaining large datasets. Even if it’s easy to access data, and even if that data is well structured, it might not be easy to upload and maintain. One example is the U.N. Comtrade dataset, which includes data on trade flows between all countries for thousands of commodities. The file size makes it impossible to handle in MS Excel and difficult in MS SQL Server, so we turned to Python Scripts to process the information, and store it in a scalable online data warehouse.

Technology and teamwork

In the course of this demanding and detailed work, we’ve learned that various formats present nuanced conflicts, so we constantly build data extraction tools tailored to individual sources. Knoema’s tools automatically refresh available data and synchronize the sources with our repository. (Client tools and packages are available in C#, Java, Python and R, and they can be found on Github.)

Excel and PDF often make machine-reading difficult. We’ve built a GUI-based data management tool (DMT) that visually structures Excel files and imports them for instant analysis, visualization, and sharing. Our DMT can also extract from machine-readable PDF files.

The Knoema repository is sharable on web and mobile, as well as via our own API, which allows third-party applications to access and adapt our data for modeling, analysis, and visualization. Our platform also supports open-data formats such as OData and SDMx. Please visit our Developer section for more details.

Over the last five years Knoema has on average added 500 million new time series and 7,000 new datasets per year, and we update about 7,500 datasets monthly.

Major trends in world trade development over the last decades

Our Team

Our engineering team boasts both technical skills and extensive domain expertise. All ingested data passes through this team for a series of quality-checks and reviews before publication.

Knoema data engineers use automated tools and manual processes alike to conduct their reviews. We also prize recency: We tag every ingested dataset with an expected next update, so the workflow kicks in again as soon as that update becomes available.

Here’s our standard data workflow process:

Our Coverage

Our team curates Knoema’s ever-evolving data collection to help capture emerging social, economic, financial, political, and industry-specific topics and trends. Over the last five years Knoema has on average added 500 million new time series and 7,000 new datasets per year, and we update about 7,500 datasets monthly.

Areas of coverage include:

  • Countries and regions. Socio-economic datasets at both the national and sub-national level for more than 150 countries, including 300 million data points for commodities trade.
  • Cities. We have data from 529 metropolitan areas—including 63 major global cities—with historical series (including nation- and topic-specific data, such as housing prices in the 70 largest cities of China).
  • Outlook. We also offer economic data from authoritative forecasting agencies as well as 70 different international and regional agencies (e.g., IMF, OECD, Eurostat, the United Nations, and development banks.)
  • Location-based data. Geolocation tags within datasets help users match data to precise locations and in the context of other relevant data points (e.g., daily weather data for all weather stations in the world, 10,000+ power plants, and 2,000+ mineral facilities.)
  • Industry data. Industry-specific inputs arranged by vertical, including natural resources, arms sales, telecom, and consumer spending. See figure below.

That’s nowhere near everything. For an idea of our range, here’s a small selection of topics we published just over the last few days: Polish retail sales; Thai livestock statistics; global happiness rankings; Australian demographics; EU cheese production; Saudi electricity consumption; and Indian leopard fatalities.

Time series data is vitally important to decision making, but access to data isn’t the end of the story.

Looking Ahead: Time Series Data and Artificial Intelligence

We’ve already told you that time series data is vitally important to decision making, but access to data isn’t the end of the story. Several steps precede it, and in the volumes we handle data in today, we’ve become reliant on specialized digital tools.

Knoema’s customers range from students to industry analysts, scientists, and journalists, and each of them have unique demands. Knoema’s platform therefore not only offers users the largest time series data repository available, we make it usable with online data tools such as advanced search, interactive dashboards, maps, presentations, correlation analysis, and scenario simulation.

But thanks to emerging technologies such as artificial intelligence and machine learning, data is no longer limited to standard workflows and methods. One of the new trends we’re following is the closer integration between text and data. For instance, we can automatically create full analytical reports based only on a set of chosen indicators. Or we can create living reports, so that instead of PDFs being accompanied by an Excel file, now text is accompanied by interactive live charts. We can also link text and data in real time, which saves time when it comes to analytics.

Knoema continually uncovers new benefits from new technologies, and we always seek out new applications for time series data. We’ll soon share the first results of our AI and machine-learning projects with our users.

Want to learn more about our data?

What Happens When We Hit Peak Oil Demand

Introduction Though most of us rarely if ever use the phrase “time series data” in our daily lives, we use the concept all the time. On the most basic level, we do it whenever we make a decision based on past experience: Should I buy this plane ticket now? Is this a good city to …

What makes the steel industry tick?

Introduction Though most of us rarely if ever use the phrase “time series data” in our daily lives, we use the concept all the time. On the most basic level, we do it whenever we make a decision based on past experience: Should I buy this plane ticket now? Is this a good city to …