Dependency Management System with Hadoop Streaming for Data-analytic ProjectsReviewed, Featured
Lin Li, Sozo Inoue,
Korea-Japan Joint Workshop on ICT
In this paper, we propose a distributed parallel processing system for data-analytic project, which manages dependency among data and analytic programs, and re-execute updated programs and dependent programs for up- dated data/programs. In the system, a data analyzer can specify the dependency, parts for requiring distributed parallel processing using Hadoop Streaming, and they can be processed only for updated and dependent part, with flexibly selecting parallel or sequential execution. The specification can also specify multiple execution for the same program for different data as a simple statement, while their dependencies are checked separately.