Thursday, December 06, 2007

1010data Offers A Powerful Columnar Database

Back in October I wrote here about the resurgent interest in alternatives to standard relational databases for analytical applications. Vendors on my list included Alterian, BlueVenn (formerly SmartFocus), Vertica and QD Technology. Most use some form of a columnar structure, meaning data is stored so the system can load only the columns required for a particular query. This reduces the total amount of data read from disk and therefore improves performance. Since a typical analytical query might read only a half-dozen columns out of hundreds or even thousands available, the savings can be tremendous.

I recently learned about another columnar database, Tenbase from 1010data. Tenbase, introduced in 2000, turns out to be a worthy alternative to better-known columnar products.

Like other columnar systems, Tenbase is fast: an initial query against a 4.3 billion row, 305 gigabyte table came back in about 12 seconds. Subsequent queries against the results were virtually instantaneous, because they were limited to the selected data and that data had been moved into memory. Although in-memory queries will always be faster, Tenbase says reading from disk takes only three times as long, which is a very low ratio. The reflects a focused effort by the company to make disk access as quick as possible.

What’s particularly intriguing is Tenbase achieves this performance without compressing, aggregating or restructuring the input. Although indexes are used in some situations, queries generally read the actual data. Even with indexes, the Tenbase files usually occupy about the same amount of space as the input. This factor varies widely among columnar databases, which sometimes expand file size significantly and sometimes compress it. Tenbase also handles very large data sets: the largest in production is nearly 60 billion rows and 4.6 terabytes. Fast response on such large installations is maintained by adding servers that process queries in parallel. Each server contains a complete copy of the full data set.

Tenbase can import data from text files or connect directly to multi-table relational databases. Load speed is about 30 million observations per minute for fixed width data. Depending on the details, this comes to around 10 gigabytes per hour. Time for incremental loads, which add new data to an existing table, is determined only by the volume of the new data. Some columnar databases essentially reload the entire file during an ‘incremental’ update.

Regardless of the physical organization, Tenbase presents loaded data as if it were in the tables of a conventional relational database. Multiple tables can be linked on different keys and queried. This contrasts with some columnar systems that require all tables be linked on the same key, such as a customer ID.

Tenbase has an ODBC connector that lets it accept standard SQL queries. Results come back as quickly as queries in the system’s own query language. This is also special: some columnar systems run SQL queries much more slowly or won’t accept them at all. The Tenbase developers demonstrated this feature by querying a 500 million row database through Microsoft Access, which feels a little like opening the door to a closet and finding yourself in the Sahara desert.

Tenbase’s own query language is considerably more powerful than SQL. It gives users advanced functions for time-series analysis, which actually allows many types of comparisons between rows in the data set. It also contains a variety of statistical, aggregation and calculation functions. It’s still set-based rather than a procedural programming language, so it doesn't support features like if/then loops. This is one area where some other columnar databases may have an edge.

The Tenbase query interface is rather plain but it does let users pick the columns and values to select by, and the columns and summary types to include in the result. Users can also specify a reference column for calculations such as weighted averages. Results can be viewed as tables, charts or cross tabs (limited to one value per intersection), which can themselves be queried. Outputs can be exported in Excel, PDF, XML, text or CSV formats. The interface also lets users create calculated columns and define links among tables.

Under the covers, the Tenbase interface automatically creates XML statements written to the Tenbase API. Users can view and edit the XML or write their own statements from scratch. This lets them create alternate interfaces for special purposes or simply to improve the esthetics. Queries built in Tenbase can be saved and rerun either in their original form or with options for users to enter new values at run time. The latter feature gives a simple way to build query applications for casual users.

The user interface is browser-based, so no desktop client is needed. Perhaps I'm easily impressed, but I like that the browser back button actually works. This is often not the case in such systems. Performance depends on the amount of data and query complexity but it scales with the number of servers, so even very demanding queries against huge databases can be returned in a few minutes with the right hardware. The servers themselves are commodity Windows PCs. Simple queries generally come back in seconds.

Tenbase clients pay for the system on a monthly basis. Fees are based primarily on the number of servers, which is determined by the number of users, amount of data, types of queries, and other details. The company does not publish its pricing but the figures it mentioned seemed competitive. The servers can reside at 1010data or at the client, although the 1010data will manage them either way. Users can load data themselves no matter where the server is located.

Most Tenbase clients are in the securities industry, where the product is used for complex analytics. The company has recently added several customers in retail, consumer goods and health care. There are about 45 active Tenbase installations, including the New York Stock Exchange, Proctor & Gamble and Pathmark Stores.

No comments: