CDH5.4.5运行Phoenix导入CSV文件

摘要:
1.安装Phoenix并在界面上设置Phoenix包裹包:http://52.11.56.155:7180/cmf/settings?groupKey=config.scm.parcel.display_group&groupParent=添加RemoteParelRepositoryURLsurl:http://archive.cloudera.com/cloudera-labs/

1.安装phoenix

在界面上设置Phoenix的parcel包:

http://52.11.56.155:7180/cmf/settings?groupKey=config.scm.parcel.display_group&groupParent=

添加一个Remote Parcel Repository URLsurl:http://archive.cloudera.com/cloudera-labs/phoenix/parcels/1.0/

CM会自动发现新的parcel,然后点击Download,Distribute and Active。重启集群

2.进入到某台服务器上,查看phoenix的安装路径

[root@ip-172-31-25-243 ~]# cd /opt/cloudera/parcels/CLABS_PHOENIX
[root@ip-172-31-25-243 phoenix]# lsbin  dev  examples  lib  phoenix-4.3.0-clabs-phoenix-1.0.0-client.jar  phoenix-4.3.0-clabs-phoenix-1.0.0-server.jar  phoenix-4.3.0-clabs-phoenix-1.0.0-server-without-antlr.jar

bin目录下为可执行文件,examples目录下为一些样例

3.导入CSV格式的表

CSV文件为/root/ceb/cis_cust_imp_info.csv,内容如下:

20131131,100010001001,BR01,2000.01
20131131,100010001002,BR01,2000.02
20131131,100010001003,BR02,2000.03

定义一个表结构的文件/root/ceb/cis_cust_imp_info.sql,内容如下,

CREATE TABLE IF NOT EXISTScis_cust_imp_info(
statistics_dt varchar(50),
cust_id varchar(50),
open_org_id varchar(50),
assert9_bal decimal(18,2),
CONSTRAINT pk PRIMARY KEY(statistics_dt, cust_id)
); 

注意最后的分号是必须的。

运行命令,导入CSV

[root@ip-172-31-25-243 phoenix]# bin/psql.py 172.31.25.244 /root/ceb/cis_cust_imp_info.sql /root/ceb/cis_cust_imp_info.csv
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
15/09/04 10:26:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/09/04 10:27:00 WARN impl.MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-phoenix.properties,hadoop-metrics2.properties
no rows upserted
Time: 0.259sec(s)

csv columns from database.
CSV Upsert complete. 3rows upserted
Time: 0.067 sec(s)

在hbase shell中进行验证:

hbase(main):001:0>list
TABLE                                                                                                                                                                                                  
CIS_CUST_IMP_INFO                                                                                                                                                                                      
SYSTEM.CATALOG                                                                                                                                                                                         
SYSTEM.SEQUENCE                                                                                                                                                                                        
SYSTEM.STATS                                                                                                                                                                                           
4 row(s) in 0.2650seconds

=> ["CIS_CUST_IMP_INFO", "SYSTEM.CATALOG", "SYSTEM.SEQUENCE", "SYSTEM.STATS"]
hbase(main):002:0> scan 'CIS_CUST_IMP_INFO'ROW                                                COLUMN+CELL                                                                                                                                         
 20131131x00100010001001                          column=0:ASSERT9_BAL, timestamp=1441362422661, value=xC2x15x01x02                                                                               
 20131131x00100010001001                          column=0:OPEN_ORG_ID, timestamp=1441362422661, value=BR01                                                                                           
 20131131x00100010001001                          column=0:_0, timestamp=1441362422661, value=                                                                                                        
 20131131x00100010001002                          column=0:ASSERT9_BAL, timestamp=1441362422661, value=xC2x15x01x03                                                                               
 20131131x00100010001002                          column=0:OPEN_ORG_ID, timestamp=1441362422661, value=BR01                                                                                           
 20131131x00100010001002                          column=0:_0, timestamp=1441362422661, value=                                                                                                        
 20131131x00100010001003                          column=0:ASSERT9_BAL, timestamp=1441362422661, value=xC2x15x01x04                                                                               
 20131131x00100010001003                          column=0:OPEN_ORG_ID, timestamp=1441362422661, value=BR02                                                                                           
 20131131x00100010001003                          column=0:_0, timestamp=1441362422661, value=                                                                                                        
3 row(s) in 0.1840 seconds

4.以MR的方式导入大量CSV文件

[root@ip-172-31-25-243 phoenix]# hadoop jar phoenix-4.3.0-clabs-phoenix-1.0.0-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool --table cis_cust_imp_info --input /root/ceb/cis_cust_imp_info.csv --zookeeper 172.31.25.244

发生错误:

java.util.concurrent.ExecutionException: java.lang.IllegalAccessError: class com.google.protobuf.HBaseZeroCopyByteString cannot access its superclass com.google.protobuf.LiteralByteString
    at java.util.concurrent.FutureTask.report(FutureTask.java:122)
    at java.util.concurrent.FutureTask.get(FutureTask.java:188)
    at org.apache.hadoop.hbase.client.HTable.coprocessorService(HTable.java:1795)
    at org.apache.hadoop.hbase.client.HTable.coprocessorService(HTable.java:1751)
    at org.apache.phoenix.query.ConnectionQueryServicesImpl.metaDataCoprocessorExec(ConnectionQueryServicesImpl.java:1006)
    at org.apache.phoenix.query.ConnectionQueryServicesImpl.getTable(ConnectionQueryServicesImpl.java:1257)
    at org.apache.phoenix.schema.MetaDataClient.updateCache(MetaDataClient.java:348)
    at org.apache.phoenix.schema.MetaDataClient.updateCache(MetaDataClient.java:309)
    at org.apache.phoenix.schema.MetaDataClient.getCurrentTime(MetaDataClient.java:293)
    at org.apache.phoenix.compile.StatementContext.getCurrentTime(StatementContext.java:253)
    at org.apache.phoenix.execute.BaseQueryPlan.iterator(BaseQueryPlan.java:184)
    at org.apache.phoenix.execute.BaseQueryPlan.iterator(BaseQueryPlan.java:154)
    at org.apache.phoenix.jdbc.PhoenixStatement$1.call(PhoenixStatement.java:235)
    at org.apache.phoenix.jdbc.PhoenixStatement$1.call(PhoenixStatement.java:226)
    at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)
    at org.apache.phoenix.jdbc.PhoenixStatement.executeQuery(PhoenixStatement.java:225)
    at org.apache.phoenix.jdbc.PhoenixStatement.executeQuery(PhoenixStatement.java:1039)
    at org.apache.phoenix.jdbc.PhoenixDatabaseMetaData.getColumns(PhoenixDatabaseMetaData.java:492)
    at org.apache.phoenix.util.CSVCommonsLoader.generateColumnInfo(CSVCommonsLoader.java:296)
    at org.apache.phoenix.mapreduce.CsvBulkLoadTool.buildImportColumns(CsvBulkLoadTool.java:291)
    at org.apache.phoenix.mapreduce.CsvBulkLoadTool.loadData(CsvBulkLoadTool.java:200)
    at org.apache.phoenix.mapreduce.CsvBulkLoadTool.run(CsvBulkLoadTool.java:186)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
    at org.apache.phoenix.mapreduce.CsvBulkLoadTool.main(CsvBulkLoadTool.java:97)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.lang.IllegalAccessError: class com.google.protobuf.HBaseZeroCopyByteString cannot access its superclass com.google.protobuf.LiteralByteString

网上搜索,发现是由于HBASE的一个bug,解决方法是:

[root@ip-172-31-25-243 phoenix]# cd /opt/cloudera/parcels/CDH/lib/hadoop
ln -s /opt/cloudera/parcels/CDH-5.4.5-1.cdh5.4.5.p0.7/lib/hbase/lib/hbase-protocol-1.0.0-cdh5.4.5.jar hbase-protocol-1.0.0-cdh5.4.5.jar

重新运行导入命令,发现如下错误:

15/09/04 11:04:43 WARN security.UserGroupInformation: PriviledgedActionException as:root (auth:SIMPLE) cause:org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=WRITE, inode="/user":hdfs:supergroup:drwxr-xr-x
    at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkFsPermission(DefaultAuthorizationProvider.java:257)
    at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:238)

原因是/user目录权限问题,用hdfs用户重新跑一遍,发生错误。用chmod 修改/user为777

sudo -u hdfs hdfs dfs -chmod 777 /user
sudo -u hdfs hadoop jar phoenix-4.3.0-clabs-phoenix-1.0.0-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool --table cis_cust_imp_info --input /root/ceb/cis_cust_imp_info.csv --zookeeper 172.31.25.244

15/09/04 11:06:05 ERROR mapreduce.CsvBulkLoadTool: Import job on table=CIS_CUST_IMP_INFO failed due to exception:org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://ip-172-31-25-243.us-west-2.compute.internal:8020/root/ceb/cis_cust_imp_info.csv
15/09/04 11:06:05 INFO client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x14f97b7df1400a4

原来用MR模式跑,文件需要放到HDFS上

这时MR运行可以顺利完成,HFile顺利生产,但是在loadIncremental环境卡住了。原因在于load到hbase中的表属于hbase:hbase,但生产的HFile文件属于当前用户和组。所以以hbase用户运行

sudo -u hbase hadoop jar phoenix-4.3.0-clabs-phoenix-1.0.0-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool --table cis_cust_imp_info --input /root/ceb/cis_cust_imp_info.csv --zookeeper 172.31.25.244

顺利搞定!

免责声明:文章转载自《CDH5.4.5运行Phoenix导入CSV文件》仅用于学习参考。如对内容有疑问,请及时联系本站处理。

上篇十大Web压力测试工具erlang浅谈下篇

宿迁高防,2C2G15M,22元/月;香港BGP,2C5G5M,25元/月 雨云优惠码:MjYwNzM=

相关文章

spark安装配置

准备:1. 规划3个集群节点:hosts主机都已经配置好映射,映射主机名分别为master,slave1,slave2,且已经进行ssh免密配置,确保端口互通,防火墙关闭 2. 先安装好scala(参考:https://www.cnblogs.com/sea520/p/13518158.html) 一. 下载spark安装包下载地址:https://ar...

熟悉Hbase常用命令及操作

(一) 编程实现以下指定功能,并用Hadoop提供的HBase Shell命令完成相同任务: 列出HBase所有的表的相关信息,例如表名; 在终端打印出指定的表的所有记录数据; 向已经创建好的表添加和删除指定的列族或列; 清空指定的表的所有记录数据; 统计表的行数。 (二)HBase数据库操作 1.现有以下关系型数据库中的表和数据,要求将其转换为适合于HB...

Flink 读写 iceberg

iceberg 0.11 发布的时候稍微尝试了一下,发现实际并没有说的那么厉害,很多功能其实还在开发中(比如: upsert) 贴段之前写的 flink sql: # HADOOP_HOME is your hadoop root directory after unpack the binary package. export HADOOP_CLASS...

C# 导出CSV功能记录下

异常问题1: 如 机构编号 00001222 导出城CSV后,前面的四个0000不显示了, 解决办法 输出格式变为 ="00001222" 异常问题2: PPMABAT01:/MABATAPS/usr/ma_batas >CH_INS_ID_CD='00105840' and trans_dt='20101028' fetch fir...

HDFS常用命令

HDFS常用命令,虽然现在流行很多hadoop插件,直接或间接操作HDFS,但是熟悉HDFS原生操作命令也是有好处的。HDFS命令和Linux系统操作命令有很多相似之处,对熟悉Linux系统操作的人来说学习很简单,所以大数据入门首先学习Linux系统。hadoop fs 和hdfs dfs 两种命令都可以操作执行。 #hdfs dfs -help [com...

Hadoop--mapreduce编程实例1

前提准备: 1.hadoop安装运行正常。Hadoop安装配置请参考:Ubuntu下 Hadoop 1.2.1 配置安装 2.集成开发环境正常。集成开发环境配置请参考 :Ubuntu 搭建Hadoop源码阅读环境 MapReduce编程实例: MapReduce编程实例(一),详细介绍在集成环境中运行第一个MapReduce程序 WordCount及代码分...