A lire sur: http://www.readwriteweb.com/enterprise/2012/06/hadoop-needs-better-bridges-to-fulfill-big-data-promise.php?utm_source=ReadWriteWeb+Newsletters&utm_medium=email&utm_campaign=c8f7606152-RWWDailyNewsletter
Hadoop is designed to store big data cheaply on a distributed file
system across commodity servers. How you get that data there is your
problem. And it’s a surprisingly critical issue because Hadoop isn’t a
replacement for existing infrastructure, but rather a tool to augment
data management and storage capabilities. Data, therefore, will be
continually going in and out.
Beyond Basic Tools
Basic tools exist, of course: Since Hadoop came into
being, simple commands like Hadoop Copy have enabled a very
straightforward and slow way to get data into Hadoop. And there’s Apache Sqoop, which is built expressly for getting data within a relational database management system (RDBMS) in and out of Hadoop.
But Sqoop has limitations of its own. It works, but it
uses low-level MapReduce jobs to accomplish the work, which introduces a
lot of complexity and (since MapReduce is done in batch jobs) time to
data import and export jobs. It might be possible to take the time, of
course, and dump your data into Hadoop just the once, but that assumes
that Hadoop will be completely replacing your data storage
infrastructure.
This is the near-forgotten side of big data: properly
placing Hadoop within existing infrastructure so data is stored cheaply,
but still quickly accessible for analysis. It is here that data
integration tools must play a role as the bridge between existing data
stores, analytics and business intelligence tools on one side, and
Hadoop on the other.
Pervasive Software
is a recent entrant to the Hadoop space, but not to the field of data
integration: The Pervasive Data Integrator is no stranger to those who
move in data circles. Earlier this month, the Austin-based company
announced a Hadoop edition
of its product that enables users to roll data from more than 200
sources into Hadoop’s Distributed File System (HDFS) or HBase, the Big
Table-type NoSQL database that runs atop Hadoop.
A Visual Approach
Unlike Sqoop, Pervasive uses a visual approach to integrating data.
“It’s a mapping problem,” described Pervasive CTO Mike
Hoskins, detailing a story of how even in development, one of
Pervasive’s developers was able to perform an off-the-cuff data
integration of 50,000 rows of data from an Oracle database to Hadoop in
seconds… and that included the time it took to visually map tables in
Oracle to Hadoop.
“He just mapped the tables, set the filters and constraints, set the target and clicked go,” Hoskins said.
Hoskins has a vested interest in talking up Pervasive, of
course, but his company’s software is part of a growing class of data
integration software geared to work with Hadoop and its ecosystem of big
data tools. Among these are Talend’s Open Studio and Enterprise Data Integration products, as well as Pentaho’s Kettle.
Data integration tools like these will make transitioning
to Hadoop a lot easier up front, along with extracting data for further
analysis with tools outside Hadoop. And they will be necessary if Big
Data is to fulfill its promise of making it easier to understand the
meanings and patterns hidden in complex information.
Aucun commentaire:
Enregistrer un commentaire