Hive QL查询

使用HiveQL查询映射到Hive表的数据集。 Hive中的SELECT基础语法和标准SQL语法基本一致,支持WHERE、DISTINCT、GROUP BY、ORDER BY、HAVING、LIMIT、子查询等。

HiveQL查询语法如下:

SELECT [ALL | DISTINCT] select_expr, select_expr, ...
  FROM table_reference
  [WHERE where_condition]
  [GROUP BY col_list]
  [ORDER BY col_list]
  [CLUSTER BY col_list
    | [DISTRIBUTE BY col_list] [SORT BY col_list]
  ]
 [LIMIT [offset,] rows]    

select...from语句

select是SQL的投影算子,from子句标识了从哪个表、视图或嵌套查询中选择记录。

例:执行简单的查询:

select * from employees;   

说明:

在Hive中查询语句往往都要被解析成MapReduce的job进行计算,但是有两个查询语句是不走MapReduce的:

  • 1) select * from employees;
  • 2) select * from employees limit 2;

注:在MYSQL中limit是取前几条记录,但是在Hive中,limit是抽样,会随机返回对应的记录数。

另外,“SELECT *”扫描整个表/文件,不会触发MapReduce jobs,因此它比“select 列名”要运行得快。

请参考Hive简单查询_以物价检测数据为例(视频教程)

一、数据准备

1、启动HDFS和YARN集群:

$ start-dfs.sh
$ start-yarn.sh

2、上传特价检测的数据文件pricewatch.csv:

$ hdfs dfs -mkdir -p /hive_data/p
$ hdfs dfs -put pricewatch.csv /hive_data/p/ 

启动Hive CLI:

$ hive 

创建一个Hive外部表pricewatch,并加载数据文件:

-- 创建数据库
create database xueai8;

-- 查看当前有哪些数据库
show databases;

-- 使用数据库
use xueai8;

-- 创建外部表,并忽略第一行(标题行)
create external table pricewatch(
        recordid string,
        jclb string,
        jcmc string,
        bqjg decimal(10,3),
        sqjg decimal(10,3),
        tb decimal(10,3),
        hb decimal(10,3)
) 
row format delimited 
fields terminated by ',' 
location '/hive_data/p'
TBLPROPERTIES ('serialization.null.format' = '','skip.header.line.count' = '1');

show tables;

-- 查看完整建表语句
show create table pricewatch; 

二、执行简单查询

下面是对pricewatch执行简单条件查询的示例。

-- limit:限制返回记录的数量
-- 查询前10条数据:
select * from pricewatch limit 10;
select * from pricewatch limit 10,5;

-- 条件查询
-- 单条件
-- 查询出所有本期价格高于5.00元的商品
select * from pricewatch where bqjg > 5.00;
-- 多条件
select * from pricewatch where bqjg > 5.00 and bqjg < 8.00;
select * from pricewatch where bqjg <= 5.00 and bqjg >= 8.00;

-- 对查询结果排序
select * from pricewatch where bqjg > 5.00 and bqjg < 8.00 order by bqjg;
select * from pricewatch where bqjg > 5.00 and bqjg < 8.00 order by bqjg desc;

-- null值比较: is null和is not null
select * from pricewatch where jclb is not null;
select * from pricewatch where jclb is not null limit 5;

-- distinct:去重
-- 查看有哪些检测类别
select distinct jclb from pricewatch;

-- in 和 not in运算符
-- 找出检测名称为“生菜”、“芹菜”和“菠菜”的商品
select * from pricewatch where jcmc in ('生菜','芹菜','菠菜');
-- 找出检测名称不是“生菜”、“芹菜”和“菠菜”的商品
select * from pricewatch where jcmc not in ('生菜','芹菜','菠菜');

-- between...and...和not between...and...
-- 找出本期价格在2块~5块之间的商品
select * from pricewatch where bqjg between 2 and 5;
-- 找出本期价格低于2块或高于5块的商品
select * from pricewatch where bqjg not between 2 and 5;

-- 找出价格环比上涨的商品
select * from pricewatch where bqjg>sqjg;
select * from pricewatch where hb>0; 

三、执行模糊查询

下面是对pricewatch执行模糊查询的示例。

通过使用like、not like和rlike关键字实现模糊匹配查询。它们之间的区别如下:

  • 1) like和not like支持两个简单匹配符号 _和%,其中"_"表示任意单个字符,字符"%"表示任意数量的任意字符。
  • 2) rlike支持标准的Java正则表达式符号。
-- 找出所有检测名称以'米'字结尾的商品
select * from pricewatch where jcmc like '%米';
select * from pricewatch where jcmc rlike '米$';

-- 找出所有检测名称以'泰'字开头的商品
select * from pricewatch where jcmc like '泰%';
select * from pricewatch where jcmc rlike '^泰';

-- 找出所有检测名称以'花'字开头、以'油'字结尾的商品
select * from pricewatch where jcmc like '花%油';
select * from pricewatch where jcmc rlike '^花.*油$';

select * from pricewatch where jcmc not like '花%油';

-- 找出所有检测名称以'米'或'油'字结尾的商品
select * from pricewatch where jcmc rlike '[米油]$';

-- 找出所有检测名称为"xxx菜"的商品
select * from pricewatch where jcmc rlike '^.{1}菜$';

-- 找出价格高于8元的大米
select * from pricewatch where bqjg>8 and jcmc like '%米';
select * from pricewatch where bqjg<3.5 and jcmc like '%米'; 

四、使用聚合函数和分组聚合查询

下面的示例在查询中使用了聚合函数。

-- 找出最高价格(最低价格、平均价格)
select max(bqjg) as max_price from pricewatch;
select avg(bqjg) as avg_price from pricewatch;
select min(bqjg) as min_price from pricewatch;

-- 统计共有多少个检测商品
select count(*) from pricewatch;
select count(*) as total from pricewatch where jclb is not null;

-- 统计每个检测类别有多少个检测商品
select jclb,count(*) from pricewatch group by jclb;
select jclb,count(*) as total from pricewatch group by jclb order by total desc;

-- 统计每种被检测商品类别的均价
select jclb,avg(bqjg) as avg_price from pricewatch group by jclb;
select distinct jclb as jc from pricewatch limit 5;
select jclb as `检测类别` from pricewatch limit 1;

-- 统计均价高于10元的商品类别
select jclb, avg(bqjg) from pricewatch group by jclb having avg(bqjg)>10;
select jclb, round(avg(bqjg),2) from pricewatch group by jclb having avg(bqjg)>10; 

五、case when语句

根据原始字段的值,输出转换后的新值。常用在机器学习数据打标签上。

create table if not exists price_label as 
select jcmc,
        case
             when bqjg>25 then 2
             when bqjg<10 then 0 
             else 1
        end as label
from pricewatch;

select * from price_label limit 10; 

六、子查询

Hive QL也支持子查询。

select p.* 
from pricewatch p 
where p.bqjg in (select max(d.bqjg) from pricewatch d); 

《PySpark原理深入与编程实战》