Hive QL查询
使用HiveQL查询映射到Hive表的数据集。 Hive中的SELECT基础语法和标准SQL语法基本一致,支持WHERE、DISTINCT、GROUP BY、ORDER BY、HAVING、LIMIT、子查询等。
HiveQL查询语法如下:
SELECT [ALL | DISTINCT] select_expr, select_expr, ... FROM table_reference [WHERE where_condition] [GROUP BY col_list] [ORDER BY col_list] [CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list] ] [LIMIT [offset,] rows]
select...from语句
select是SQL的投影算子,from子句标识了从哪个表、视图或嵌套查询中选择记录。
例:执行简单的查询:
select * from employees;
说明:
在Hive中查询语句往往都要被解析成MapReduce的job进行计算,但是有两个查询语句是不走MapReduce的:
- 1) select * from employees;
- 2) select * from employees limit 2;
注:在MYSQL中limit是取前几条记录,但是在Hive中,limit是抽样,会随机返回对应的记录数。
另外,“SELECT *”扫描整个表/文件,不会触发MapReduce jobs,因此它比“select 列名”要运行得快。
一、数据准备
1、启动HDFS和YARN集群:
$ start-dfs.sh $ start-yarn.sh
2、上传特价检测的数据文件pricewatch.csv:
$ hdfs dfs -mkdir -p /hive_data/p $ hdfs dfs -put pricewatch.csv /hive_data/p/
启动Hive CLI:
$ hive
创建一个Hive外部表pricewatch,并加载数据文件:
-- 创建数据库 create database xueai8; -- 查看当前有哪些数据库 show databases; -- 使用数据库 use xueai8; -- 创建外部表,并忽略第一行(标题行) create external table pricewatch( recordid string, jclb string, jcmc string, bqjg decimal(10,3), sqjg decimal(10,3), tb decimal(10,3), hb decimal(10,3) ) row format delimited fields terminated by ',' location '/hive_data/p' TBLPROPERTIES ('serialization.null.format' = '','skip.header.line.count' = '1'); show tables; -- 查看完整建表语句 show create table pricewatch;
二、执行简单查询
下面是对pricewatch执行简单条件查询的示例。
-- limit:限制返回记录的数量 -- 查询前10条数据: select * from pricewatch limit 10; select * from pricewatch limit 10,5; -- 条件查询 -- 单条件 -- 查询出所有本期价格高于5.00元的商品 select * from pricewatch where bqjg > 5.00; -- 多条件 select * from pricewatch where bqjg > 5.00 and bqjg < 8.00; select * from pricewatch where bqjg <= 5.00 and bqjg >= 8.00; -- 对查询结果排序 select * from pricewatch where bqjg > 5.00 and bqjg < 8.00 order by bqjg; select * from pricewatch where bqjg > 5.00 and bqjg < 8.00 order by bqjg desc; -- null值比较: is null和is not null select * from pricewatch where jclb is not null; select * from pricewatch where jclb is not null limit 5; -- distinct:去重 -- 查看有哪些检测类别 select distinct jclb from pricewatch; -- in 和 not in运算符 -- 找出检测名称为“生菜”、“芹菜”和“菠菜”的商品 select * from pricewatch where jcmc in ('生菜','芹菜','菠菜'); -- 找出检测名称不是“生菜”、“芹菜”和“菠菜”的商品 select * from pricewatch where jcmc not in ('生菜','芹菜','菠菜'); -- between...and...和not between...and... -- 找出本期价格在2块~5块之间的商品 select * from pricewatch where bqjg between 2 and 5; -- 找出本期价格低于2块或高于5块的商品 select * from pricewatch where bqjg not between 2 and 5; -- 找出价格环比上涨的商品 select * from pricewatch where bqjg>sqjg; select * from pricewatch where hb>0;
三、执行模糊查询
下面是对pricewatch执行模糊查询的示例。
通过使用like、not like和rlike关键字实现模糊匹配查询。它们之间的区别如下:
- 1) like和not like支持两个简单匹配符号 _和%,其中"_"表示任意单个字符,字符"%"表示任意数量的任意字符。
- 2) rlike支持标准的Java正则表达式符号。
-- 找出所有检测名称以'米'字结尾的商品 select * from pricewatch where jcmc like '%米'; select * from pricewatch where jcmc rlike '米$'; -- 找出所有检测名称以'泰'字开头的商品 select * from pricewatch where jcmc like '泰%'; select * from pricewatch where jcmc rlike '^泰'; -- 找出所有检测名称以'花'字开头、以'油'字结尾的商品 select * from pricewatch where jcmc like '花%油'; select * from pricewatch where jcmc rlike '^花.*油$'; select * from pricewatch where jcmc not like '花%油'; -- 找出所有检测名称以'米'或'油'字结尾的商品 select * from pricewatch where jcmc rlike '[米油]$'; -- 找出所有检测名称为"xxx菜"的商品 select * from pricewatch where jcmc rlike '^.{1}菜$'; -- 找出价格高于8元的大米 select * from pricewatch where bqjg>8 and jcmc like '%米'; select * from pricewatch where bqjg<3.5 and jcmc like '%米';
四、使用聚合函数和分组聚合查询
下面的示例在查询中使用了聚合函数。
-- 找出最高价格(最低价格、平均价格) select max(bqjg) as max_price from pricewatch; select avg(bqjg) as avg_price from pricewatch; select min(bqjg) as min_price from pricewatch; -- 统计共有多少个检测商品 select count(*) from pricewatch; select count(*) as total from pricewatch where jclb is not null; -- 统计每个检测类别有多少个检测商品 select jclb,count(*) from pricewatch group by jclb; select jclb,count(*) as total from pricewatch group by jclb order by total desc; -- 统计每种被检测商品类别的均价 select jclb,avg(bqjg) as avg_price from pricewatch group by jclb; select distinct jclb as jc from pricewatch limit 5; select jclb as `检测类别` from pricewatch limit 1; -- 统计均价高于10元的商品类别 select jclb, avg(bqjg) from pricewatch group by jclb having avg(bqjg)>10; select jclb, round(avg(bqjg),2) from pricewatch group by jclb having avg(bqjg)>10;
五、case when语句
根据原始字段的值,输出转换后的新值。常用在机器学习数据打标签上。
create table if not exists price_label as select jcmc, case when bqjg>25 then 2 when bqjg<10 then 0 else 1 end as label from pricewatch; select * from price_label limit 10;
六、子查询
Hive QL也支持子查询。
select p.* from pricewatch p where p.bqjg in (select max(d.bqjg) from pricewatch d);