When an analyst is not certain which financial indicators should be
used in a new forecast model or trading system, he typically starts by considering all forms of data that might have
some relation to the desired result. For example, when developing a model intended to forecast aluminum prices, one
might start by collecting historical metal prices as well as indicators related to future supply and demand. Such
indicators may include the consumer confidence index, projected car sales (autos require lots of aluminum), projected
oil prices (which may affect car sales), and related indices. This collection of market data time series could be quite
large, exceeding 50 or more entities.
Should 50 or more variables be fed to a forecast model? As unintuitive as this may
appear, models with fewer input variables frequently outperform those utilizing more variables! Statisticians
attribute this counter-intuitive behavior to two well understood phenomena:
OVER-FITTING . . . too many variables permit the model/system to focus on the
random as well as non-random aspects of market behavior. The developed model/system will then try forecasting random as
well as non-random market movement. But, like forecasting the roll of dice, this will probably lead to disastrous
results.
MULTI-COLINEARITY . . . some input variables may yield essentially the same
information, with only slight differences between them. Pairs of such indicators are said to be correlated.
Models that are "trained" stochastically may depend on such correlations and catastrophic consequences occur
when these correlations change even a little over time.
The model/system builder can address both phenomena by reducing the number of
inputs to a bare minimum. But if you were to start with just 10 indicators, you would need to consider over 1,000
possible combinations of variables! If you were to start with 100 indicators, there would be about 2100
combinations, and your computer would spend the next 13 eons considering them all! Civilization may not be around by
then.
Some modeling tools try to get around this problem by performing correlation
analysis between pairs of input variables and deleting one indicator from each highly correlated pair. This approach is
faulty as it may delete one or more critical indicators, because for some problems, two correlated inputs are exactly
what the model may need! For example, a model may requires as input either the futures contract price or the spot price
because the two are highly correlated and only one or the other is sufficient. On the other hand, some models require
both inputs because they need to know the difference or premium.
Is there an efficient way to decorrelate input data and reduce
the number of input variables simultaneously, without throwing away potentially useful information?
|
Professional forecasters use software based on solid techniques that large
companies can afford to buy. This puts individual traders at a disadvantage. Until now, that is. DDR, the Decorrelator
and Dimension Reducer, gives you access to the same powerful technique used by professional data analysts.
To reduce the number of variables for your model (regression, neural net, etc.)
DDR does not eliminate any of your input variables. Dimension reduction is attained in a completely different
way.
Instead, DDR processes all N input fields (explanatory variables, such as price,
MACD, interest rates, etc.) and converts them into N new explanatory variables (e.g.. N new columns on a spreadsheet).
These new indicators, in total, represent all the information found in your original columns; however, DDR concentrates
most of the information into just a few of these new variables.
DDR also ranks all the new indicators it produces according to how much of the
original information each new indicator represents. This way, you can eliminate most of the other new variables without
losing much information, if any at all. As a result, you have the same information as before, but now it is represented
by fewer variables. On a typical data set, DDR reduces the number of explanatory variables by at least 50% without
losing more than 1% of the total information supplied by all the original indicators. That's dimension
reduction!
As an added benefit, the new time series indicators produced by DDR are completely
decorrelated, normalized and with zero-mean. Just right for input to a non-linear regression model, such as a neural
network.