CONTEXTUAL DATA PROCESSING

Victor E. Hansen, MKS Inc.

 

"It ought to be remembered that there is nothing more difficult to take in hand, more perilous to conduct, or more uncertain in its success, than to take the lead in the introduction of a new order of things. Because the innovator has for enemies all those who have done well under the old conditions, and lukewarm defenders in those who may do well under the new."
- Niccolo Machiavelli, The Prince,   1513

Abstract

Advances in information technology are flooding us with data. Current data models are no longer adequate for obtaining the information we need from this data. We propose that making information systems more human-like will make interacting with humans more effective. Numerous ways in which humans and information systems differ are discussed. V4, a contextual multidimensional data model, is introduced followed by examples of V4's use in several problem domains.

Introduction

Technology is allowing us to store ever greater amounts of data. Not only can we store more data but more types of data. Business data, sales data from grocery scanners, music in the form of CDs, pictures, sounds, books, even television is going digital with the acceptance of the new standards and HDTV. All of this is becoming more and more available through the World Wide Web and the Internet. In some instances we can actually get at the data and answers we seek, but as the quantities and types of data increase, getting answers becomes difficult to impossible.

Solutions to the data problem will come about only through novel ideas and perspectives. Our premise is that to make data more accessible we must first enable it with human-like qualities. This calls for new paradigms in both the organization and manipulation of data.

To this end we have developed a theoretical model for describing and accessing data. The model incorporates several new ideas, which vastly simplify the problems associated with describing and analyzing large databases. The two basic principles of this model are context and multidimensionality- two qualities inherent in the way people think. We have developed a product, V4, based on these concepts and have been able to use it to elegantly solve many difficult problems.

This white paper describes the ideas and characteristics of V4. First, however, we look at ways that people view data (or knowledge) and how these views differ from the operating characteristics of database systems. Next we examine some of the characteristics of V4 to show how it can enable users to better access data by modeling and manipulating the data in a form that is much closer to how people think about data. Problem domains well suited for V4 are reviewed. The paper closes with an update of V4's implementation status and current applications.

How People Differ from Database Systems

We make no claim as to knowing how people actually think. We can be fairly safe in saying that people can and do represent, store, and think about data in ways that no commercial database product can begin to do. Some of the major ways in which people think and react differently from database systems are discussed in the following section.

Contextual

We believe that the biggest difference between people's memory and databases is the notion of context. People interpret everything within a context. Words, phrases, actions, intentions are all "evaluated" (and colored) by the who, what, where, and when. Not only do external events and situations affect context but people's internal ideas and moods also exert a great influence on how everything is interpreted.

Context allows us to easily do things that are very difficult to do with computer systems. People apply the same general rules in different contexts. People pretend and play make-believe. People communicate the same information in different forms for different contexts (e.g. explaining a computer to a child versus an adult).

What versus How

Rarely when a person wants an answer to a question do they want to know how the answer was obtained ("Tell me the time, not how to make a watch"). This is usually not the case with requests to an information system. Details about the structure of the databases, how they interrelate, and specific field names are all necessary to obtain answers.

Generalize with Exceptions

A typical pattern of learning is to first use a general rule ("i before e") and then learn the exceptions ("except after c"). People are very good at this; to the point that they can recite the general rules but not be able to tell you the exceptions until an exception happens. Databases are not good at dealing with this type of situation. Databases store data, not general rules.

Facts, Rules, Procedures

If you are talking to someone on the phone and he asks you for your home phone number you can recite it right off. If he asks for the number of the local pizza parlor, which you don't know, but have right by the phone, you can recite it back to him. The person on the phone would think that you knew both numbers. The reality is that you only knew the first but could calculate the number for the pizza quickly enough so that the person on the phone could not tell.

Knowing how to get the answer is almost as good as knowing the answer. The way we store information in our heads makes the boundary between knowing and knowing how seamless.

Facts About Themselves (Metadata)

People know facts and lots of stuff about the facts. Most people can find a phone number in a phone book. They can also tell you about the phone book, how it is organized, and what the different codes and abbreviations mean. This type of knowledge is known as metaknowledge, or when applied to data in information systems, metadata. Newer information systems incorporate metadata but usually as a separate component, i.e. data and metadata are rarely intermixed.

Multidimensional

We have the ability to store knowledge by varying parameters or dimensions. Ask a person if a TV show is any good and you might get a response like, "The last show was pretty bad, but in general it's OK. My wife likes it, but the kids don't." In this response we have information about the show's quality by time, by specific show versus most shows, and by individuals. How would you represent this in a database?

The real world is varyingly multidimensional. By that we mean that not only is it multidimensional but that the number of dimensions associated with a data element varies. How many factors or dimensions go into the price of an airline ticket these days, and which factors apply when you purchase a ticket for business and which apply when you purchase a ticket for a family vacation?

Relational databases are essentially three dimensional- tables, rows and columns. Additional dimensions are created by duplicating tables, rows, or columns. The techniques and commands to access data vary depending on how the data is embedded in a particular row, column, and table. Determining how to represent all the dimensions associated with data is one of the major challenges for a database administrator/designer. The task of adding dimensions once a system has been up and running can be next to impossible. For example, if a pricing database has been designed with customer, date, item, and quantity as the components (or dimensions) of the price of items and now the sales rep commission must be added as an additional component. What databases have to change, what procedures, what reports?

People Forget

"Use it or lose it" as the saying goes. People forget facts that are not used. How much of your grade school, high school, and college education do you really remember? Databases do not forget yet studies have shown that data quality suffers greatly over time if the data is not used. This raises interesting issues for the design and long term use of data warehousing.

Summary

Why bother with these differences? Because incorporating some of our human abilities into information systems would result in many benefits. A very brief summary is given below.

  • Contextual -; A contextual system would be able to interpret requests and data according to the requestor (for example year-to-date might mean seasonal to a marketing manager, and fiscal to a financial person). Answers would be given in a format appropriate to the requestor (different languages, different levels of detail according to the expertise of the requestor). Context would also allow users to set up multiple what-if or pretend scenarios to examine and manipulate data and models in different situations (What if an x% decrease in price resulted in a y% increase in sales? How would that affect the bottom line, salesrep commissions, customer volume discounts?)
  • What versus How -; Being able to request what data you want without having to specify how to get it removes the underlying database structure from the users. People should not have to know the gruesome details of a data warehouse in order to access the data within it.
  • Generalize with Exceptions -; The ability to store general rules and exceptions would make many databases much smaller, simpler, and easier to understand. For instance, a pricing database may have thousands of records (one for each inventory item), when in fact, the prices could have been represented with something as simple as "all prices are 20% over cost except item xyz which is $4.98 and abc which is $10.40."
  • Facts, Rules, Procedures -; An information system must be able to manipulate facts, rules, and procedures in a seamless fashion in order to provide the high level of integration needed to answer real-world inquires. It is no longer sufficient to simply report back data or sums of data; financial, statistical, and industry specific models and massaging must be performed.
  • Metadata -; An information system that "knew" all about itself, when coupled with the ability to incorporate procedures, should be able to automatically generate low level database queries in response to a high level user request (i.e. allow the user to ask what, not specify how).
  • Multidimensional -; An information system designed around a multidimensional paradigm would allow end-users much greater freedom to access and interrelate data. Users would not be tied to the relational table-row-column structure of the underlying data.
  • Forgetting -; The point here is not to have information systems be able to "forget" but to ensure that the data residing in the system is accessed in a timely manner to ensure its quality.

The V4 Model

The V4 model is a new paradigm for the representation of information systems. V4 views the world through a contextual multidimensional model. It is not based on the concepts of tables, rows, and columns found in relational systems. It is not based on artificial distinctions between data, metadata, and procedures. V4 is a complete information system model that enables you to integrate data from multiple sources, manipulate this data, and report on it- all in a contextual framework. The remainder of this section gives a brief overview of the important V4 concepts.

Multidimensional

V4 is based on the concepts of dimensions and points contained within a dimension. In V4, a dimension represents a concept and points along that dimensional line represent instances of that concept. For example an Integer dimension would have points representing the integer numbers. A color dimension would have points representing the different colors. Dimensions can be used to represent any concept so we can have a customer dimension where each point represents a customer. Or a flight dimension where each point represents a scheduled airline flight. At a more abstract level, we may also have a string dimension where each point represents a text string or a C-Program dimension where each point represents a C language program.

Points, unto themselves, do not carry any information. Information is represented as the combination or intersection of points. For instance, the point representing inventory item xyz, by itself, has no information content. Nor does the point representing the concept of "price of". But if we put them together, or intersect them in the multidimensional space, then we get another point representing "the price of item xyz" which we can assign or bind a dollar value which is itself just another point. In V4 we bind value points to the intersection of other points. Facts are represented by binding values to the intersections of specific points as in "the price of item xyz is $9.99". Rules are represented by binding values to the intersections of point ranges like "the price of all items is 20% over cost". Remember that procedures are points so the formula "20% over cost" is another point. V4 allows us to have specific facts and rules, which may overlap as in the two examples just cited. If we were to ask for the price of an item, V4 would see that there were two possible intersections and take the best one. In this example the best one would result in $9.99 for item xyz and "20% over cost" for all others.

V4 uses sophisticated rules to determine which of several possible intersections is the best match. In this pricing example intersections such as "the price of item xyz", "the price of all items", "the price of item xyz on 10-Jan-98" and "the price of all items sold to customer abc during 1998" can all coexist and result in different answers. The users and designers of V4 applications do not have to be concerned with mechanisms underlying this representation -; it is all performed by the V4 runtime system.

Contextual

V4's contextual nature works hand-in-hand with the multidimensional model. The V4 context is simply a collection of points maintained as an independent set. V4 uses the context to determine the best answer when presented with a query. Again, using the above example, if the query were stated as "the price of item xyz?", V4 would find the best intersection of the point for "price of" with the point for item xyz plus any points in the context. If the context contained the point "10-Jan-98", even though the query only specified "the price of item xyz", the result would be the intersection of "the price of item xyz on 10-Jan-98". This is because, in the current context, it is the best match. Similarly, if the context also included the point for customer abc then the best intersection for the query would be "the price of all items sold to customer abc during 1998".

The power of this approach lies in the ability to design implicit V4 procedures, which become explicit when executed or evaluated in a particular context. This means that generic rules can be specified once and then used in many different contexts.

Context is used to maintain two types of points. The first, as demonstrated in the above examples, is to use context to reflect the current state. Context points of this type include the date, the user, the type of request, or anything else reflecting the current environment and query.

The second class of points is for declaring a pretend or what-if scenario. By adding contextual scenario points we can say "What if …?" or "Let's assume that …". Expanding on the pricing examples above, we could make the binding "In scenario A, the price of item xyz for customer bcd is $7.99". Evaluating the query "price of item xyz" in the context of customer bcd would normally result in $9.99. However if "Scenario A" were also in the context then the "price of item xyz" would result in "$7.99". Context can be used for more interesting examples- "In scenario B, the price of xyz is 10% less than normal". Here, "normal" refers to whatever the result would have been without scenario B.

V4 Primitives

Many standard routines found in statistical, financial, and general programming environments are provided by the V4 environment in the form of predefined primitives.

  • Time -; Time can be modeled in many different ways with many different dimensions. Primitives are provided to quickly and accurately convert among the many different modes of time.
  • Statistical -; All of the standard distributions, inverses, functions, and tests are available directly within V4. These functions combined with V4's ability to manipulate time make it an extremely powerful statistical analyzer.
  • Financial -; Many high level financial routines are provided including those for the time-value-of-money, coupons, securities, and treasury bills.
  • List Processing -; Lists are an important construct within V4. Primitives are provided to enumerate through lists (e.g. all points on a dimension), to sort lists, to select subsets of lists, and to merge lists. "Lazy lists" can be defined to efficiently enumerate through extremely large lists.
  • Database -; Large data sets can be handled in several ways within V4. For optimum speed, the data can be loaded directly into V4. Where there is simply too much information or the up-to-date accuracy of the data is critical, V4 can access data through ODBC interface primitives. Non-ODBC databases can be access through a text table handler.
  • Dimensional -; Much of the data to be analyzed by V4 is not in a multidimensional format. V4 primitives can be used to quickly scan flat data (upwards of 20,000 rows per second) to translate it temporarily or permanently into a multidimensional form.
  • V4 -; Primitives within V4 can be used to construct new V4 dimensions, points, intersections, and bindings. This enables V4 to construct new rules, as it needs them. For example, metadata information can be used to convert a high-level user request into the specific rules required to access, aggregate, analyze, and report back the answers.

Problem Domains for V4

Very little research has been done in contextual systems although recent interest seems to be growing. Context can be used in many different problem areas. Robotics, reasoning systems, expert systems, natural language processing, are all prime candidates for contextual models. Below we will limit the discussion to the use of context in business database management. Note that all of the examples given below were generated from within Microsoft Excel using a V4 add-in.

Financials

V4 has many characteristics making it ideal for financial analysis and reporting. Dimensions can be defined for company, department, profit center, general ledger account, forecast, budget, currency, etc. Reports for any combination of dimensions can be easily generated. New budgets can be quickly formulated with general rules and filled in by department heads with specific facts. Different revisions can be maintained. Models (see below) can be developed and then run on any subset of the data. Econometric data from other sources can be compared and V4's financial primitives can be used to compare and optimize.

Monthly Actuals for Year 1996 Department 2292

Account

Description

January
February
March

6100

SALARIES - OFFICE

20,941

23,768

25,241

 

 

38,300

38,300

38,300

6120

EMPLOYEE BENEFITS EXPENSE

5,444

6,586

6,941

 

 

10,500

10,500

10,500

6330

PROMOTIONAL EXPENSE

426

0

0

 

 

0

0

0

6335

TRAVEL EXPENSE

1,096

9,382

5,092

 

 

2,000

2,000

2,000


The above report shows actual versus budget for department "2292". The first line for each expense account is the actual for the period. The line below is the corresponding budget amount. If the actual expense exceeds the budget the actual amount is highlighted in red. The report (with formatting) was generated with less than ten V4 rules.

Sales Analysis

Sales data can be captured in flat files (as is typical in relational transaction processors) and then viewed multidimensionally within V4. Sales by any time component, geographic area, sales territory, inventory attribute, vendor, or customer data can be easily and quickly generated. Statistical routines can be run to project future demand. Results of any analysis are immediately available for additional financial reporting or manufacturing modeling. Sales figures can be altered within what-if scenarios giving executives the opportunity to view possible management decision outcomes.

 

1996Q1

1996Q1-$

1996Q2

1996Q2-$

1996Q3

1996Q3-$

1996Q4

1996Q4-$

ANGEL

JOHCH

62,826

JOHCH

65,190

STRON

65,966

JOHCH

65,763

 

HIGHA

24,470

MELIJR

47,393

MENAR

20,610

GRAYR

51,499

 

POLLO

17,482

WEIDA

17,103

POLLO

12,893

MELIJR

45,819

 

 

 

 

 

 

 

 

 

BARRJ

SHAPI

48,018

REFFI

32,722

REFFI

6,289

REFFI

12,676

 

REFFI

29,557

SHAPI

13,369

SHAPI

5,471

SHAPI

3,382

 

 

 

 

 

 

 

 

 

BROWL

RUCKS

98,130

MAHEM

52,392

METZL

31,577

METZL

33,945

 

MAHEM

39,779

TUTTLG

33,000

GRIEM

23,807

RALPH

30,858

 

HUCKE

35,306

FUHR

30,930

TUTTLG

19,516

WEEKS

30,387


The above report is an excerpt of a "top sales rep by manager" report. It lists for each sales manager, the top three sales reps and sales for each of the four quarters in 1996. The top rep for each quarter (across all managers) is highlighted in red italics. This report was generated from "flat" sales order detail data with less than ten V4 rules (five rules for the data, four for formatting).

V4 typically has low level transaction data available. This can often be used for dynamic drill-down. High level results can be quickly displayed followed with more specific drill-downs for supporting detail.

Data Quality Models and Assurance

As it becomes easier and cheaper to collect more and more data, the tendency to "grab it now because we may need it later" becomes more prevalent. The less this data is used, the more dubious its value and quality. V4 can be utilized to summarize, correlate, tabulate, and point out trends in large amounts of data helping to ensure its quality. Metadata can be used to ensure that interrelationships are meaningful and that all collected data in a warehouse is being reviewed in some fashion.

Metadata Repository

Metadata is literally "data about data". It is more than just the mechanics of a database; it describes what we know about the components of the database. What one person knows about something might not be exactly the same as what another person knows. Necessary requirements for a state-of-the-art metadata repository include being able to represent- different points of view about a database, how the database has changed over time, and how different systems reference the data. A contextual multidimensional space can incorporate these different points of view and reference them accordingly.

Given a metadata repository, the next step is using that information in assisting users. For instance, using the information to navigate a complex data warehouse. Again, a multidimensional model with the ability to incorporate procedures easily allows this.

Security Models

Some IS managers view database security as a subset of a metadata repository, others think it deserves its own classification. In either case, security is about allowing or denying access to data in various contexts. The V4 model allows the security administrator to initially set up general access rules and then add more specific access rules as situations warrant. The general rules can either permit access (with subsequent rules denying as needed) or deny access (with subsequent rules allowing as needed). In either case new contexts can be added with time without having to upset any existing security policies.

Manufacturing Models

Complex models can be easily developed, maintained, and evaluated with V4. Rules can be changed with a simple change of context. Data from a variety of sources can be run through the model again by a simple change of context. The example below shows the Excel output of a production-scheduling model. The input data for this spreadsheet was taken from a production schedule matrix. Raw materials were exploded through a bill of materials file. Finally actual onhand and expected purchase order receipts were merged. The entire model, explosion, aggregation, and exporting to Excel was done with less than 15 V4 rules (eight to perform all of the data manipulation from flat files, six to format for Excel).

 

 

 

 

Needed

Needed

Short

Short

Id

Description

Receipts
OnHand
2/2/98
2/9/98
2/9/98
2/9/98

SKU10067I

CANDY, SUGAR FREE HARD BULK

0

4,666

0

0

 

 

SKU10250D

SHRINK BANDS

20,000

0

0

37,080

 

37,080

 

PO# 72679 on 05-Feb-98 for 15000

 

 

 

 

 

 

 

PO# 72730 on 15-Dec-98 for 5000

 

 

 

 

 

 

SKU10258G

LBL-HADD PNUT CRUNCH 6 OZ

0

0

12,360

37,080

12,360

37,080

SKU10264I

FLAVOR BUTTER PLUS EMULSION

0

0

0

0

 

 

SKU10266I

CANDY GOLD FOIL CROQUETTES

0

1,333

50

60

 

 

SKU10278I

NUTS, FANCY MED PECAN PCS

0

994

0

1,896

 

902

SKU10316C

15 CAV OLD FASH CRML

25,000

6,729

0

0

 

 


The color-coding indicates problem areas. Red items indicate that onhand quantities do not cover production needs. Green items indicate that onhand quantities are insufficient but that expected receipts should cover the shortfall.

Inter-Systems Data Flow

V4's contextual nature makes it an excellent choice for integrating and coordinating the data flow among different (legacy) systems. V4's powerful data manipulation capabilities and internal primitives for translating data types simplify the task of implementation.

The above flow chart represents two aviation systems. Each references an altitude. For system A altitude may be defined as feet above sea level, for system B it may be defined as meters above ground level. V4 can reside between these two systems and translate altitude to the value appropriate to each system's context.

Conclusions

What V4 Is and Is Not

V4 is not a canned package. It does not provide its own graphical user interface. It is not a tool to be given directly to end-users. V4 is a powerful new tool for maintaining, analyzing, manipulating, and reporting data. V4 is to be used behind the scenes in one of the following ways-

  • As one of a set of tools for a turnkey solution. V4 integrates well with relational data and can be integrated with many different front and back-end products.
  • As a background data processor for Excel. An Excel add-in has been developed giving Excel users the ability to directly access V4. Macros can be used to guide users through any necessary setups. Specific routines can be easily incorporated into the Excel menu structure. Users can perform what-if analysis and drill-down operations directly from within Excel.
  • As a semi-natural language processor of user requests. Data, metadata, and procedures can all be incorporated into V4. High level user requests can be converted directly into low level V4 rules (commands) with relatively simple generic V4 rules. Flexible ad-hoc access to complex data warehouses can be easily accomplished.

Implementation Status

V4 has been under development for approximately seven years. Patent applications were applied for key aspects of the technology in 1993 and have been recently approved. Currently, V4 runs on most major Unix platforms, VMS, and WindowsNT. An add-on dynamic link library has been developed for Excel and is used for interactive access. V4 has been used in a variety of applications ranging from profit center analysis, to budgeting, to sales forecasting. Current models are in the hundreds of millions data point range.

Performance is measured two ways. The first is the size of the data models. The models are typically loaded with non-aggregated subsets of a transactional database. A load of raw data would typically result in a data model 80-90 percent of the size of the original data. V4 is an interpreted system with the basic operation being the evaluation of an intersection. Peak intersection rates on a 200+ megahertz Pentium-class processor can reach upwards of 100,000 intersections per CPU second. Performance is also based on the data and level of pre-aggregation. Scan rates of flat data have been clocked at over 20,000 rows per CPU-second, or 1.2 million rows per minute.

Summary

Contextual database systems are an important new concept in information systems. V4 is currently the only package offering both multidimensionality and contextual evaluation. Only V4 can offer all of the following:

  • Manipulation of concepts and points, not relations and rows;
  • Representation of rules and facts in a single framework;
  • Ability to store low-level transactional data and master file data as well as aggregated data;
  • Incorporation of calculations/procedures as points;
  • Use of context to allow different meanings to the same "words";
  • Use of context for sophisticated "what-if" analysis;
  • Modeling and representation of metadata within the same framework as the target data;

Ability to automatically generate low level queries from high level requests.


May/June 1997 Software Developer & Publisher 

 ©2011 MKS, Inc.

MKS, Inc. • 987 Old Eagle School Road • Wayne, PA 19087-1708
(610) 989-9905 • 888-PICK-MKS
sales@mksinc.com • webmaster@mksinc.com