Hive的函数

Hive函数分类

函数的定义和java、mysql一样，有三种。

UDF（User Definition Function 用户定义函数）

一路输入，一路输出
sin(30°)=1/2

UDAF（User Definition Aggregation Function 聚合函数）

多路输入，一路输出
max min count sum avg等等

UDTF（User Definition Table Function 表函数）

一路输入，多路输出
explode

常用函数

show functions;             列出hive中可用的函数列表
desc function func_name;    查看函数的帮助说明

case when   ---->switch或if else
if          ---->三元运算符
explode     ---->将数组中的元素转换成多行数据
a = [1, 2, 3, 4] explode(a) ===>
        1
        2
        3
        4
split       ---->就是字符串中的split函数

array       ---->
collect_set
collect_list
concat_ws   ---->使用给定的字符串来连接元素
--------------
row_number  ---->分组排序或者二次排序

函数案例

wordcount

分析：
    hello   you
    hello   me
    hello   he
使用mr的的过程
    step1----->split("\t")--->
        ["hello", "you"]
        ["hello", "me"]
        ["hello", "he"]
    step2----->遍历每一个数组，将数组中的每一个值，作为key，value为1写出去<key, 1>
        <"hello", 1>
        <"you", 1>
        <"hello", 1>
        <"me", 1>
        <"hello", 1>
        <"he", 1>

    step3,shuffle--->
        <"hello", [1, 1, 1]>
        <"you", 1>
        <"me", 1>
        <"he", 1>
    step 4, reduce ====>reduceByKey
使用hql
    step 1 (mydb1)> select split(line, "\t") from test;
            ["hello","you"]
            ["hello","he"]
            ["hello","me"]
    step 2 将数组中的每一行数据转化为多行
            (mydb1)> select explode(split(line, "\t"))  from test;
                hello
                you
                hello
                he
                hello
                me
    step 3 在step2的基础之上进行group by 即可
        select
            w.word, count(w.word) as count 
        from (select explode(split(line, "\t")) word  from test) w
        group by w.word order by count desc;

case when

case when将一下对应的部门名称显示出来：

1--->学工组，2--->行政组,3---->销售组,4---->研发组,5---->其它
hive (mydb1)> select * from t1;
1
2
3
4
5
select
  id,
case id
  when 1 then "学工组"
  when 2 then "行政组"
  when 3 then "销售组"
  when 4 then "研发组"
  else "行政组"
end
from t1;    
分类显示
1   学工组
2   行政组
3   销售组
4   研发组
5   其它

row_number 二次排序

三种连接
    交叉连接
        across join，会有笛卡尔积，所以不用
    内连接(等值连接)
        inner join
        将左表和右表中能够匹配的上的数据做输出
    外链接
        outer join
        左外连接(left outer join)

        右外链接(right outer join)

根据员工、部分、薪资，这三张表，
    1、分组显示每一个部分员工的信息（启动显示部分名称，员工姓名，员工性别[男|女]，员工薪资），同时分组按照员工薪资降序排序
        select
           e.name, if(sex == 0, '女', '男') as gender, d.name, s.salary,
           row_number() over(partition by e.deptid order by s.salary desc) rank
        from t_dept d
        left join t_employee e on d.id = e.deptid
        left join t_salary s on e.id = s.empid
        where s.salary is not null;
    2、获取显示部门薪资top2的员工信息
        select 
           tmp.* 
        from 
        (select
           e.name, if(sex == 0, '女', '男') as gender, d.name, s.salary, 
           row_number() over(partition by e.deptid order by s.salary desc) rank 
        from t_dept d
        left join t_employee e on d.id = e.deptid
        left join t_salary s on e.id = s.empid
        where s.salary is not null) tmp
        where tmp.rank < 3; 
        如果查询的是单表，则可以不用子查询，只用用having来获取即可(having rank < 3)

直接看下面的一个例子就可以知道row_number的使用方法了：

hive (mydb2)> create table t9(
            >   id int,
            >   province string,
            >   salary float
            > );
hive (mydb2)> insert into t9 values(1,'gd',18000),(2,'gd',16000),(3,'bj',13000),(4,'gd',15000),(5,'bj',17000),(6,'bj',19000);
hive (mydb2)> select * from t9;
OK
1       gd      18000.0
2       gd      16000.0
3       bj      13000.0
4       gd      15000.0
5       bj      17000.0
6       bj      19000.0
Time taken: 0.097 seconds, Fetched: 6 row(s)
hive (mydb2)> select
            >   id,
            >   province,
            >   salary,
            >   row_number() over(partition by province order by salary desc) as rank
            > from t9;
OK
6       bj      19000.0 1
5       bj      17000.0 2
3       bj      13000.0 3
1       gd      18000.0 1
2       gd      16000.0 2
4       gd      15000.0 3
Time taken: 1.578 seconds, Fetched: 6 row(s)

Hive自定义函数

自定义函数步骤

自定义函数需要遵循的6个步骤：

1°、自定义一个Java类来继承UDF类
2°、覆盖其中的evaluate()的函数，有系统去调用
3°、将写好的程序打成一个jar，上传至服务器
4°、将3°中的jar加载到hive的classpath
hive终端执行add jar jar_path;
5°、给自定义函数设置一个临时的名称，也就是说要创建一个临时的函数
create temporary function 函数名 as '写的evalutor所在类的全类名';
6°、执行函数结束之后，可以手动销毁临时函数，或者不用管，因为当前会话消失，函数自动销毁

UDF案例：要根据用户的birthday，统计对应的×××和星座

程序代码如下：

package com.uplooking.bigdata.hive.udf;

import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Calendar;
import java.util.Date;

@Description(name = "z_c",
        value = "_FUNC_(param1, param2) - 返回给定日期对应的×××或者星座",
        extended = "param1,param2参数可以是一下:\n"
                + "1. param1 is A string in the format of 'yyyy-MM-dd HH:mm:ss' or 'yyyy-MM-dd'.\n"
                + "2. param1 date value\n"
                + "3. param1 timestamp value\n"
                + "3. param2 0 or 1, 0 means constellation， 1 means zodica\n"
                + "Example:\n "
                + "  > SELECT _FUNC_('2009-07-30', 0) FROM src LIMIT 1;\n" + "  狮子座")
public class ZodicaAndConstellationUDF extends UDF {

    public Text evaluate(java.sql.Date date, int type) {
        if(type == 0) {//星座
            return new Text(getConstellation(new Date(date.getTime())));
        } else if(type == 1) { //×××
            return new Text(getZodica(new Date(date.getTime())));
        }
        return null;
    }

    public String[] zodiacArr = { "猴", "鸡", "狗", "猪", "鼠", "牛", "虎", "兔", "龙", "蛇", "马", "羊" };
    public String[] constellationArr = { "水瓶座", "双鱼座", "白羊座", "金牛座", "双子座", "巨蟹座", "狮子座", "×××座", "天秤座", "天蝎座", "射手座", "魔羯座" };
    public int[] constellationEdgeDay = { 20, 19, 21, 21, 21, 22, 23, 23, 23, 23, 22, 22 };
    /**
     * 根据日期获取×××
     * @return
     */
    public String getZodica(Date date) {
        Calendar cal = Calendar.getInstance();
        cal.setTime(date);
        return zodiacArr[cal.get(Calendar.YEAR) % 12];
    }
    /**
     * 根据日期获取星座
     * @return
     */
    public String getConstellation(Date date) {
        if (date == null) {
            return "";
        }
        Calendar cal = Calendar.getInstance();
        cal.setTime(date);
        int month = cal.get(Calendar.MONTH);
        int day = cal.get(Calendar.DAY_OF_MONTH);
        if (day < constellationEdgeDay[month]) {
            month = month - 1;
        }
        if (month >= 0) {
            return constellationArr[month];
        }
    // default to return 魔羯
        return constellationArr[11];
    }
}

注意依赖在笔记最后面。

上传到服务器后，在hive终端中加载到hive的classpath：

add jar /home/uplooking/jars/hive/udf-zc.jar

自定义函数：

create temporary function zc as 'com.uplooking.bigdata.hive.udf.ZodicaAndConstellationUDF';

创建测试用的临时表：

hive (mydb1)>
            > create temporary table tmp(
            > birthday date);

插入测试用的数据：

hive (mydb1)> insert into tmp values('1994-06-21');

在查询中使用函数：

hive (mydb1)> select zc(birthday,0) from tmp;
OK
c0
双子座
Time taken: 0.084 seconds, Fetched: 1 row(s)
hive (mydb1)> select zc(birthday,1) from tmp;
OK
c0
狗
Time taken: 0.044 seconds, Fetched: 1 row(s)

下面是一个更简单的UDF函数，可以参考进行测试：

package cn.xpleaf.hive.udf;

import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

/**
 * @author Leaf
 * @date 2018/9/18 下午11:11
 */
@Description(name = "addUDF", value = "_FUNC_(num1, num2) - 返回给定两个数的和")
public class AddUDF extends UDF {

    public Text evaluate(int num1, int num2) {
        return new Text(String.valueOf(num1 + num2));
    }

}

Hive之jdbc

Hive除了提供前面的cli用户接口，还提供了jdbc的用户接口，但是如果需要使用该接口，则需要先启动hiveserver2服务，启动该服务后，可以通过hive提供的beeline继续以cli的方式操作hive（不过需要注意的是，此时是通过jdbc接口进行操作hive的），也可以通过手工编写java代码来进行操作。

启动hiveserver2服务

[uplooking@uplooking01 ~]$ hiveserver2

通过beeline连接hiveserver进行操作

[uplooking@uplooking01 hive]$ beeline
which: no hbase in (/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/jdk/bin:/home/uplooking/bin:/home/uplooking/app/zookeeper/bin:/home/uplooking/app/hadoop/bin:/home/uplooking/app/hadoop/sbin:/home/uplooking/app/hive/bin)
ls: 无法访问/home/uplooking/app/hive/lib/hive-jdbc-*-standalone.jar: 没有那个文件或目录
Beeline version 2.1.0 by Apache Hive
beeline> !connect jdbc:hive2://uplooking01:10000/mydb1
Connecting to jdbc:hive2://uplooking01:10000/mydb1
Enter username for jdbc:hive2://uplooking01:10000/mydb1: uplooking
Enter password for jdbc:hive2://uplooking01:10000/mydb1: *********
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/uplooking/app/hive/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/uplooking/app/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Error: Failed to open new session: java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: uplooking is not allowed to impersonate uplooking (state=,code=0)

可以看到出现错误，解决方案如下：

在执行JDBC的时候，访问不了远程的Hive的ThriftServer服务
报的错误：uplooking不能伪装为uplooking
    是因为版本在进行升级的时候考虑到的安全策略，需要我们手动对uplooking进行配置，需要将
hadoop中的uplooking用户和hive中的uplooking用户进行打通，配置在$HADOOP_HOME/etc/hadoop/core-site.xml
中进行配置：添加一下配置项
    <property>
        <name>hadoop.proxyuser.uplooking.hosts</name>
        <value>*</value>
        <description>这是uplooking用户访问的本机地址</description>
    </property>
    <property>
        <name>hadoop.proxyuser.uplooking.groups</name>
        <value>root</value>
        <description>代理uplooking设置的组用户</description>
    </property>     
配置成功之后，需要同步到集群中的各个节点，
要想让集群重新加载配置信息，至少hdfs需要重启

这样之后就可以正常使用beeline通过hive提供的jdbc接口来操作hive了：

beeline> !connect jdbc:hive2://uplooking01:10000/mydb1
Connecting to jdbc:hive2://uplooking01:10000/mydb1
Enter username for jdbc:hive2://uplooking01:10000/mydb1: uplooking
Enter password for jdbc:hive2://uplooking01:10000/mydb1: *********
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/uplooking/app/hive/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/uplooking/app/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connected to: Apache Hive (version 2.1.0)
Driver: Hive JDBC (version 2.1.0)
18/03/23 08:00:15 [main]: WARN jdbc.HiveConnection: Request to set autoCommit to false; Hive does not support autoCommit=false.
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://uplooking01:10000/mydb1> show databases;
+----------------+--+
| database_name  |
+----------------+--+
| default        |
| mydb1          |
+----------------+--+
2 rows selected (2.164 seconds)
0: jdbc:hive2://uplooking01:10000/mydb1> show tables;
+-----------+--+
| tab_name  |
+-----------+--+
| t1        |
| t2        |
+-----------+--+
2 rows selected (0.118 seconds)
0: jdbc:hive2://uplooking01:10000/mydb1> select * from t1;
+------------+--+
|  t1.line   |
+------------+--+
| hello you  |
| hello he   |
| hello me   |
+------------+--+
3 rows selected (2.143 seconds)
0: jdbc:hive2://uplooking01:10000/mydb1>

通过java代码连接hiveserver进行操作

程序代码如下：

package com.uplooking.bigdata.hive.jdbc;

import java.sql.*;

public class HiveJDBC {
    public static void main(String[] args) throws Exception {
        Class.forName("org.apache.hive.jdbc.HiveDriver");
        Connection conn = DriverManager.getConnection("jdbc:hive2://uplooking01:10000/mydb1", "uplooking", "uplooking");
        String sql = "select t.word,count(t.word) as count from (select explode(split(line, ' ')) as word from t1) t group by t.word";
        PreparedStatement ps = conn.prepareStatement(sql);
        ResultSet rs = ps.executeQuery();
        while (rs.next()) {
            String word = rs.getString("word");
            int count = rs.getInt("count");
            System.out.println(word + "\t" + count);
        }
        rs.close();
        ps.close();
        conn.close();
    }
}

程序执行结果如下：

18/03/23 00:48:16 INFO jdbc.Utils: Supplied authorities: uplooking01:10000
18/03/23 00:48:16 INFO jdbc.Utils: Resolved authority: uplooking01:10000
he  1
hello   3
me  1
you 1

在这个过程中，注意观察hiveserver2终端的输出：

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = uplooking_20180323084825_63044683-393d-4625-a3c3-b440109c3d70
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1521765850571_0002, Tracking URL = http://uplooking02:8088/proxy/application_1521765850571_0002/
Kill Command = /home/uplooking/app/hadoop/bin/hadoop job  -kill job_1521765850571_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-03-23 08:48:33,427 Stage-1 map = 0%,  reduce = 0%
2018-03-23 08:48:40,864 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.54 sec
2018-03-23 08:48:48,294 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.84 sec
MapReduce Total cumulative CPU time: 6 seconds 840 msec
Ended Job = job_1521765850571_0002
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 6.84 sec   HDFS Read: 8870 HDFS Write: 159 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 840 msec
OK

Hive中文注释乱码解决

如果有乱码出现，可以尝试下面的解决方案：

    hive中文注释乱码解决：
    在hive的元数据库中，执行一下脚本
        ALTER TABLE COLUMNS_V2 MODIFY COLUMN COMMENT VARCHAR(256) CHARACTER SET utf8;
        ALTER TABLE TABLE_PARAMS MODIFY COLUMN PARAM_VALUE VARCHAR(4000) CHARACTER SET utf8;
        ALTER TABLE PARTITION_PARAMS MODIFY COLUMN PARAM_VALUE VARCHAR(4000) CHARACTER SET utf8;
        ALTER TABLE PARTITION_KEYS MODIFY COLUMN PKEY_COMMENT VARCHAR(4000) CHARACTER SET utf8;
        ALTER TABLE INDEX_PARAMS MODIFY COLUMN PARAM_VALUE VARCHAR(4000) CHARACTER SET utf8;
    同时将url，加上utf-8
        &useUnicode=true&characterEncoding=UTF-8
          <property>
            <name>javax.jdo.option.ConnectionURL</name>
            <value>jdbc:mysql://uplooking01:3306/hive?createDatabaseIfNotExist=true&useUnicode=true&characterEncoding=UTF-8</value>
          </property>

Hive的maven依赖

<properties>
   <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
   <hive-api.version>2.1.0</hive-api.version>
   <hadoop-api.version>2.6.4</hadoop-api.version>
   <hadoop-core.version>1.2.1</hadoop-core.version>
</properties>

<dependencies>
<dependency>
  <groupId>junit</groupId>
  <artifactId>junit</artifactId>
  <version>4.12</version>
  <scope>test</scope>
</dependency>
<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-common</artifactId>
  <version>${hadoop-api.version}</version>
</dependency>
<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-mapreduce-client-core</artifactId>
  <version>${hadoop-api.version}</version>
</dependency>
<dependency>
  <groupId>org.apache.hadoop</groupId>
  <artifactId>hadoop-core</artifactId>
  <version>${hadoop-core.version}</version>
</dependency>
<dependency>
  <groupId>org.apache.hive</groupId>
  <artifactId>hive-exec</artifactId>
  <version>${hive-api.version}</version>
</dependency>
<dependency>
  <groupId>org.apache.hive</groupId>
  <artifactId>hive-serde</artifactId>
  <version>${hive-api.version}</version>
</dependency>
<dependency>
  <groupId>org.apache.hive</groupId>
  <artifactId>hive-service</artifactId>
  <version>${hive-api.version}</version>
</dependency>
<dependency>
  <groupId>org.apache.hive</groupId>
  <artifactId>hive-metastore</artifactId>
  <version>${hive-api.version}</version>
</dependency>
<dependency>
  <groupId>org.apache.hive</groupId>
  <artifactId>hive-common</artifactId>
  <version>${hive-api.version}</version>
</dependency>
<dependency>
  <groupId>org.apache.hive</groupId>
  <artifactId>hive-cli</artifactId>
  <version>${hive-api.version}</version>
</dependency>
<dependency>
  <groupId>org.apache.hive</groupId>
  <artifactId>hive-jdbc</artifactId>
  <version>${hive-api.version}</version>
</dependency>
<dependency>
  <groupId>org.apache.thrift</groupId>
  <artifactId>libfb303</artifactId>
  <version>0.9.0</version>
</dependency>
</dependencies>

Hive笔记整理（三）

Hive的函数

Hive函数分类

UDF（User Definition Function 用户定义函数）

UDAF（User Definition Aggregation Function 聚合函数）

UDTF（User Definition Table Function 表函数）

常用函数

函数案例

wordcount

case when

row_number 二次排序

Hive自定义函数

自定义函数步骤

UDF案例：要根据用户的birthday，统计对应的×××和星座

Hive之jdbc

启动hiveserver2服务

通过beeline连接hiveserver进行操作

通过java代码连接hiveserver进行操作

Hive中文注释乱码解决

Hive的maven依赖

阁主

相关推荐

大佬们的评论抢沙发

女生也可以快速建出专业的网站

热门专题

分类目录

猜你喜欢

全新“一站式”建站，高质量、高售后的一条龙服务

微信抖音支付宝百度头条快手全平台打通信息流

觉得文章有用就打赏一下文章作者

非常感谢你的打赏，我们将继续给力更多优质内容，让我们一起创建更加美好的网络世界！

支付宝扫一扫

微信扫一扫

切换注册登录

切换登录注册

Hive的函数

Hive函数分类

UDF（User Definition Function 用户定义函数）

UDAF（User Definition Aggregation Function 聚合函数）

UDTF（User Definition Table Function 表函数）

常用函数

函数案例

wordcount

case when

row_number 二次排序

Hive自定义函数

自定义函数步骤

UDF案例：要根据用户的birthday，统计对应的×××和星座

Hive之jdbc

启动hiveserver2服务

通过beeline连接hiveserver进行操作

通过java代码连接hiveserver进行操作

Hive中文注释乱码解决

Hive的maven依赖

阁主

相关推荐

大佬们的评论 抢沙发

女生也可以快速建出专业的网站

热门专题

分类目录

猜你喜欢

全新“一站式”建站，高质量、高售后的一条龙服务

微信 抖音 支付宝 百度 头条 快手全平台打通信息流

觉得文章有用就打赏一下文章作者

非常感谢你的打赏，我们将继续给力更多优质内容，让我们一起创建更加美好的网络世界！

支付宝扫一扫

微信扫一扫

切换注册登录

切换登录注册

大佬们的评论抢沙发

微信抖音支付宝百度头条快手全平台打通信息流