后面的文章分析中,我会先说明一个基本的演进过程,即为什么会使用到Spark Thrift JDBCServer,其在大数据生态栈中大致是处于一个什么样的位置?我想理解了这些之后,使用Spark Thrift JDBCServer会显得更加“自然”。不过如果读者本身已经经历过MapReduce、Hive、Spark On Yarn、Spark On Yarn With Hive、Spark SQL等的一些使用历程,相信只要看过官方文档的介绍,应该就能知道其定位以及出现的原因。因此这部分实际上是介绍一些相关大数据组件的作用以及演进,通过这些介绍来理解Spark Thrift JDBCServer所处在的位置。
Hive被称为SQL On Hadoop或者SQL On MapReduce,是一款建立在Hadoop之上的数据仓库的基础框架,简单理解,在Hive上,你可以像在RDB中一样,编写SQL语句来对你的数据进行分析,Hive的解释器会把SQL语句转换为MapRedcue作业,提交到Yarn上去运行,这样一来,只要会写SQL语句,你就能构建强大的MapReduce分布式应用程序。
如何避免前面Spark SQL中的这种尴尬?Spark SQL的其中一个分支就是Spark on Hive,也就是使用Hive中HQL的解析、逻辑执行计划翻译、执行计划优化等逻辑,可以近似认为仅将物理执行计划从MR作业替换成了Spark作业。SparkSql整合hive就是获取hive表中的元数据信息,然后通过SparkSql来操作数据。
[code class="language-shell"]$ cd bin/
$ ./beeline[/code] 然后连接上Thriftserver:
[code class="language-shell"]Beeline version 1.2.1.spark2 by Apache Hive
beeline> !connect jdbc:hive2://localhost:10000
Connecting to jdbc:hive2://localhost:10000
Enter username for jdbc:hive2://localhost:10000:
Enter password for jdbc:hive2://localhost:10000:
2019-07-13 15:58:40 INFO Utils:310 - Supplied authorities: localhost:10000
2019-07-13 15:58:40 INFO Utils:397 - Resolved authority: localhost:10000
2019-07-13 15:58:40 INFO HiveConnection:203 - Will try to open client transport with JDBC Uri: jdbc:hive2://localhost:10000
Connected to: Spark SQL (version 2.3.3)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000>[/code] 之后就可以进行各种SQL操作:
[code class="language-shell"]0: jdbc:hive2://localhost:10000> create table person
0: jdbc:hive2://localhost:10000> (
0: jdbc:hive2://localhost:10000> id int,
0: jdbc:hive2://localhost:10000> name string
0: jdbc:hive2://localhost:10000> );
+---------+--+
| Result |
+---------+--+
+---------+--+
No rows selected (1.116 seconds)
0: jdbc:hive2://localhost:10000> insert into person values(1,'xpleaf');
+---------+--+
| Result |
+---------+--+
+---------+--+
No rows selected (1.664 seconds)
0: jdbc:hive2://localhost:10000> select * from person;
+-----+---------+--+
| id | name |
+-----+---------+--+
| 1 | xpleaf |
+-----+---------+--+
1 row selected (0.449 seconds)[/code] 这时再去前面说的4040页面看一下:
[code class="language-shell"]mysql> use hive;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
[code class="language-shell"]hive> show databases;
OK
default
Time taken: 0.937 seconds, Fetched: 1 row(s)
hive> show tables;
OK
Time taken: 0.059 seconds
hive> create table person
> (
> id int,
> name string
> );
OK
Time taken: 0.284 seconds
hive> insert into person values(1,'xpleaf');
...省略提交MapReduce作业信息...
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 HDFS Read: 4089 HDFS Write: 79 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 17.54 seconds
hive> select * from person;
OK
1 xpleaf
Time taken: 0.105 seconds, Fetched: 1 row(s)[/code]
[code class="language-shell"]980 Caused by: org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException: The specified datastore driver ("com.mysql.jdbc.Driver") was not found in the CLASSPATH. Please check your CLASS PATH specification, and the name of the driver.
981 at org.datanucleus.store.rdbms.connectionpool.AbstractConnectionPoolFactory.loadDriver(AbstractConnectionPoolFactory.java:58)
982 at org.datanucleus.store.rdbms.connectionpool.BoneCPConnectionPoolFactory.createConnectionPool(BoneCPConnectionPoolFactory.java:54)
983 at org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:238)
984 ... 91 more[/code] 也就是找不到mysql驱动,可以将之前hive下的该驱动拷贝到spark的jars目录下:
[code class="language-shell"]Caused by: MetaException(message:Hive Schema version 1.2.0 does not match metastore's schema version 2.1.0 Metastore is not upgraded or corrupt)[/code]
原因,看spark jars目录下提供的jar包:
~/app2/spark/jars$ ls hive- hive-beeline-1.2.1.spark2.jar hive-cli-1.2.1.spark2.jar hive-exec-1.2.1.spark2.jar hive-jdbc-1.2.1.spark2.jar hive-metastore-1.2.1.spark2.jar
...
2019-07-13 17:16:47 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1774c4e2{/sqlserver/session,null,AVAILABLE,@Spark}
2019-07-13 17:16:47 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@f0381f0{/sqlserver/session/json,null,AVAILABLE,@Spark}
2019-07-13 17:16:47 INFO ThriftCLIService:98 - Starting ThriftBinaryCLIService on port 10000 with 5...500 worker threads
看一下端口号:
[code class="language-shell"]~/app2/spark/sbin$ lsof -i:10000
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
java 5122 yeyonghao 317u IPv6 0x3cb645c07a5bcbbb 0t0 TCP *:ndmp (LISTEN)[/code]
3.2.3 启动beeline进行测试
这里我们启动beeline进行操作:
[code class="language-shell"]~/app2/spark/bin$ ./beeline
Beeline version 1.2.1.spark2 by Apache Hive[/code] 之前我们在Hive中创建了一张person表,如果跟Hive整合成功,那么这里也应该可以看到,因为共用的是同一个metastore,如下查看其中的数据:
[code class="language-shell"]beeline> !connect jdbc:hive2://localhost:10000
Connecting to jdbc:hive2://localhost:10000
Enter username for jdbc:hive2://localhost:10000:
Enter password for jdbc:hive2://localhost:10000:
2019-07-13 17:20:02 INFO Utils:310 - Supplied authorities: localhost:10000
2019-07-13 17:20:02 INFO Utils:397 - Resolved authority: localhost:10000
2019-07-13 17:20:02 INFO HiveConnection:203 - Will try to open client transport with JDBC Uri: jdbc:hive2://localhost:10000
Connected to: Spark SQL (version 2.3.3)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000> show tables;
+-----------+------------+--------------+--+
| database | tableName | isTemporary |
+-----------+------------+--------------+--+
| default | person | false |
+-----------+------------+--------------+--+
1 row selected (0.611 seconds)
0: jdbc:hive2://localhost:10000> select * from person;
+-----+---------+--+
| id | name |
+-----+---------+--+
| 1 | xpleaf |
+-----+---------+--+
1 row selected (1.842 seconds)[/code] 可以看到,没有问题,再查看4040端口:
这里我们再创建一张person2表:
[code class="language-shell"]0: jdbc:hive2://localhost:10000> create table person2
0: jdbc:hive2://localhost:10000> (
0: jdbc:hive2://localhost:10000> id int,
0: jdbc:hive2://localhost:10000> name string
0: jdbc:hive2://localhost:10000> );
+---------+--+
| Result |
+---------+--+
+---------+--+
No rows selected (0.548 seconds)[/code] 这时可以去保存元数据信息的mysql数据库中看一下,在tbls表中保存了我们创建的数据表信息:
his script accepts all bin/spark-submit command line options, plus a --hiveconf option to specify Hive properties. You may run ./sbin/start-thriftserver.sh --help for a complete list of all available options. By default, the server listens on localhost:10000. You may override this behaviour via either environment variables, i.e.:
……
[code class="language-shell"]~/app2/spark/sbin$ ./start-thriftserver.sh --master yarn
starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to /Users/yeyonghao/app2/spark/logs/spark-yeyonghao-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-yeyonghaodeMacBook-Pro.local.out
failed to launch: nice -n 0 bash /Users/yeyonghao/app2/spark/bin/spark-submit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --name Thrift JDBC/ODBC Server --master yarn
Spark Command: /Library/Java/JavaVirtualMachines/jdk1.8.0_181.jdk/Contents/Home/bin/java -cp /Users/yeyonghao/app2/spark/conf/:/Users/yeyonghao/app2/spark/jars/* -Xmx1g org.apache.spark.deploy.SparkSubmit --master yarn --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --name Thrift JDBC/ODBC Server spark-internal
========================================
Exception in thread "main" java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
at org.apache.spark.deploy.SparkSubmitArguments.validateSubmitArguments(SparkSubmitArguments.scala:288)
at org.apache.spark.deploy.SparkSubmitArguments.validateArguments(SparkSubmitArguments.scala:248)
at org.apache.spark.deploy.SparkSubmitArguments.<init>(SparkSubmitArguments.scala:120)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:130)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
full log in /Users/yeyonghao/app2/spark/logs/spark-yeyonghao-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-yeyonghaodeMacBook-Pro.local.out[/code] 可以看到其报错信息,关键为:
[code class="language-shell"]When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.[/code] 直接在spark-env.sh中添加: