We explore a stacked framework for learning to predict dependency structures for natural language sentences. A typical approach in
graph-based dependency parsing has been to assume a factorized model, where local features are used but a global function is optimized (McDonald et al., 2005b). Recently Nivre and McDonald (2008) used the output of one dependency parser to provide features
for another. We show that this is an example of stacked learning, in which a second predictor is trained to improve the performance
of the ﬁrst. Further, we argue that this technique is a novel way of approximating rich non-local features in the second parser, without sacriﬁcing efﬁcient, model-optimal prediction. Experiments on twelve languages show that stacking transition-based and graphbased parsers improves performance over existing state-of-the-art dependency parsers.