CDH Components (V4.5) and Talend Installation &Configuration Steps



CDH Components (V4.5) and Talend Installation &Configuration Steps

Part I. Hadoop (single node mode)

1.       Unziphadoop package

 

2.      Edit HADOOP_HOME and JAVA_HOME in~./bash_profile

 

3.       Editmapred-site.xml, core-site.xml and yarn-site.xml respectively in ${HADOOP_HOME}/etc/hadoopdirectory

a.      Add below properties into mapred-site.xml

 

b.     Add namenode path and port to core-site.xml


c.      Add nodemanager shuffle class and applicationlib path to yarn-site.xml

Note. In Apache hadoop distribution thenodemanager use ‘mapredue_shuffle’ as aux-services value instead of‘mapreduce.shuffle’ in CDH distribution.

4.      Edit hadoop running environment includingconfigure JAVA_HOME, HADOOP_YARN_HOME and HADOOP_CONF_DIR in hadoop-env.sh. Ofcourse, you could also specify HADOOP_COMMON_HOME, HADOOP_MAPRED_HOME,HADOOP_HDFS_HOME.

 

5.      Format hdfs and startup dfs and yarn threads

./hdfsnamenode –foramt

./start-dfs.sh   ./stop-dfs.sh

./start-yarn.sh./stop-yarn.sh

 

6.      At default, the hadoop yarn console could beaccessed by http://hostname:8088/, dfsconsole could be accessed by http://hostname:50070, theseport numbers could also be set in yarn-site.xml and hdfs-site.xml respectively.

Part II. Hive

1.       UnzipHive package as same as hadoop package.

 

2.      Configure HIVE_HOME in ~/.bash_profile as sameas hadoop

 

3.       Changehive.insert.into.multilevel.dirs to true, the default setting is false. If itdoes not get changed, you’re gonna insert nothing into a multi-level directoryor a table. But in Apache distribution, the default value is true.

 

4.      When using Java api to transfer data from hiveto mongoDB, we encounter an error which is the application could not read Hiveconfiguration information. Then you need to specify below propertiesHIVE_CONF_DIR and HIVE_AUX_JARS_PATH in your hive-env.sh.

export HIVE_CONF_DIR=$HIVE_HOME/conf

export HIVE_AUX_JARS_PATH=$HIVE_HOME/lib


5.      Startup hive server

There are three kinds of servercould be launched. One is command line mode, the second one is thrift mode andthe third one is hwi (hive web interface) mode. If you want to connect hiveserver through Java application or other client end, you got to use thriftmode. You could access web interface through http://hostname:9999/hwi, writingHiveQL and view results is pretty convenient within Hive Web Interface.

 

CommandLine Mode:

./hive

 

ThriftMode:

./hive–service hiveserver

 

Hwimode:

./hive–service hwi

Part III. Sqoop

1.       UnzipSqoop package like above steps

 

2.      Add SQOOP_HOME in ~/.bash_profile

 

3.      If the machine to be installed Sqoop has alreadybeen installed a Tomcat, then you should make sure there are not anyenvironment variables like CATALINA_HOME or CATALINA_BASE being set in~/.bash_profile. That would result in a confliction.

Make sure those have been commented out.

 

4.      Edit catalina.properties file in ${SQOOP_HOME}/server/conf,configure common.loader to add tomcat jars and hadoop jars and some othernecessary jars for Sqoop runtime.


Note. If your hadoop distribution does not contain guava.jar in itslib, you also need add this jar into common.loader. Also, ojdbc6.jar is neededif transferring data between hdfs and Oracle is asked.

 

5.      Edit sqoop.properties in the same directory.Change default log dir, namely replace all @LOG_DIR@ to places where you wantto put log files.

 

6.      Startup Sqoop.

${SQOOP_HOME}/bin/sqoop.sh server start

${SQOOP_HOME}/bin/sqoop.sh server stop

 

If website http://hostname:12000/sqoop could beaccessed successfully, then it presents that Sqoop server has already beenlaunched.

 

7.      Command line client interactive mode

set server --host vm-9ac7-806d.apac.nsroot.net --port 12000 --webappsqoop

create connection --cid 1

create job --xid 1 --type import // create job --xid 1 --type export

start job --jid 1

 

more:

update job --jid 1

delete job --jid 1

 

Part IV. Talend

1.      Download Talend Open Studio package and unzipit. You could find below file hierarchy.

According to the Java version you are currentlyusing, like we use Java 32-bit. Then we choose TOS_BD_win32-x86.exe.

 

2.      Change Java Interpreter to JDK instead of JRE.

Go to Window -> Preference -> Talend

 

3.      Some Talend components need extra Jar file whenyou pull them out, just add them to Talend modules could solve this problem.

 

 

Part V. Talend Examples

1.      Execute HiveQL in Hive, including create tableand select query.

 

 

 

 

Configuration:

·        tHiveConnection_1


·        tHiveRow_1


·        tHiveRow_2


2.      Load data from HDFS to Oracle Table.


·        tLibraryLoad_1


Add third-party Oracle driver to this application.

·        tSqoopExport_1


Note. Table Name field should type in “SchemaName + TableName”,besides, table name should be upper case.


3.      Programs Errors in Talend

 

a.      When Loading data from HDFS to Oracle table,“NoSuchElement” error is encountered.

Solution:

The problem is possibly caused by the number of Oracle table filedscannot match those of the file in HDFS. Check fields of both sides and makethem matched will solve this problem.

b.     Fields Format Error

Solution:

Wrong delimiter mainly results in this kind of error. Note thatSqoop will decide which kind of data need to be transformed by Oracle tableschema. Wrong delimiter will split fields into wrong sets, non-proper valuewill be populated into different fields, that caused this error.

To make sure Oracle table fields types match with right column typesof the file to be exported in the HDFS.

相关内容

    暂无相关文章