Hive函数&压缩

大数据

发布日期: 2019-02-28

文章字数: 756

阅读时长: 3 分

阅读次数:

1、排序

Order By:全局排序
1）按照员工表的奖金金额进行正序排序
select * from emptable order by emptable.comm asc;
可以省略asc

2）按照员工表的奖金金额进行倒序排序
select * from emptable order by emptable.comm desc;

3）按照部门和奖金进行升序排序
select * from emptable order by deptno,comm;

Sort By: 内部排序（区内有序，全局无序）
设置reduce个数的属性：set mapreduce.job.reduces = 3;
select * from dept_partitions sort by deptno desc;

Distribute By: 分区排序
1）先按照部门编号进行排序再按照地域编号进行降序排序。
select * from dept_partitions distribute by deptno sort by loc desc;

Cluster By: 分桶排序
1）按照部门编号进行排序
select * from dept_partitions cluster by deptno;

注意：如果Distrbute和Sort by 是相同字段时，可以用cluster by代替

2、分桶

分桶分的是文件
1）创建分桶表
clustered by(id) into 4 buckets

hive> set mapreduce.job.reduces=4;
hive> create table emptable_buck(id int, name string)
    > clustered by(id) into 4 buckets
    > row format
    > delimited fields
    > terminated by '\t';

查看表的描述信息

hive> desc formatted emptable_buck;

加载数据

hive> load data local inpath '/root/hsiehchou.txt' into table emptable_buck;
hive> create table emptable_b(id int, name string)
    > row format
    > delimited fields
    > terminated by '\t';

清空表

hive> truncate table emptable_buck;

加载数据（桶）

hive> load data local inpath '/root/hsiehchou.txt' into table emptable_b;

设置桶的环境变量(插入数据时分桶，不开启默认在一个桶里面)

hive> set hive.enforce.bucketing=true;
hive> truncate table emptable_buck;

用户需要统计一个具有代表性的结果时，并不是全部结果！抽样！
(bucket 1 out of 2 on id）
1：第一桶数据
2：代表拿两桶

hive> select * from emptable_buck  tablesample(bucket 1 out of 2 on id);

3、UDF自定义函数

查看内置函数
show functions;

查看函数的详细内容
desc function extended upper;

UDF:一进一出
UDAF:聚合函数多进一出 count /max/avg
UDTF:一进多出

java
导入Hive的lib下的所有jar包
编程java代码

package com.hsiehchou;
import org.apache.hadoop.hive.ql.exec.UDF;
public class MyConcat extends UDF {
    //将大写转换成小写
    public String evaluate(String a, String b) {
        return a + "******" + String.valueOf(b);
    }   
}

export此文件，打包jar，放入hsiehchou121中

添加临时：
add jar /root/Myconcat.jar;
create temporary function my_cat as “com.hsiehchou.MyConcat”;

<!-- 注册永久：hive-site.xml -->
<property>
<name>hive.aux.jars.path</name>
<value>file:///root/hd/hive/lib/hive.jar</value>
</property>

4、Hive压缩

存储：hdfs
计算：mapreduce

Map输出阶段压缩方式
开启hive中间传输数据压缩功能
set hive.exec.compress.intermediate=true;

开启map输出压缩
set mapreduce.map.output.compress=true;

设置snappy压缩方式
set mapreduce.map.output.compress.codec=org.apache.hadoop.io.com
press.SnappyCodec;

Reduce输出阶段压缩方式
设置hive输出数据压缩功能
set hive.exec.compress.output=true;

设置mr输出数据压缩
set mapreduce.output.fileoutputformat.compress=true;

指定压缩编码
set mapreduce.output.fileoutputformat.compress.codec=org.apache.
hadoop.io.compress.SnappyCodec;

指定压缩类型块压缩
set mapreduce.output.fileoutputformat.compress.type=BLOCK;

测试结果
insert overwrite local directory ‘/root/datas/rs’ select * from emptable order by sal desc;

谢舟

https://blog.hsiehchou.com/2019/02/28/hive-han-shu-ya-suo/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源谢舟 !

大数据 Hive

Flume

1、Flume概述：Flume是一种分布式，可靠且可用的服务，用于有效地收集，聚合和移动大量日志数据。它具有基于流数据流的简单灵活的架构。它具有可靠的可靠性机制和许多故障转移和恢复机制，具有强大的容错性。它使用简单的可扩展数据模型，允许在线

2019-03-02 大数据

大数据 Flume

Hive的SQL操作

1、分区表1）创建分区表hive> create table dept_partitions() > partition by() > row format > delimited

2019-02-27 大数据

大数据 Hive SQL

1、排序

2、分桶

3、UDF自定义函数

4、Hive压缩

你的赏识是我前进的动力