Finding information on Hive tables from HDFS
I was curious about our Hive tables total usage on HDFS and what the average filesize was with the current partitioning scheme so wrote this ruby script to calculate it.
current = ''
file_count = 0
total_size = 0
output = File.open('output.csv','w')
IO.popen('hadoop fs -lsr /user/hive/warehouse').each_line do |line|
split = line.split(/\s+/)
#permissions,replication,user,group,size,mod_date,mod_time,path
next unless split.size == 8
path = split[7]
size = split[4]
permissions = split[0]
tablename=path.split('/')[4]
if tablename != current
average_size = file_count == 0 ? 0 : total_size/file_count
result = "#{current},#{file_count},#{total_size},#{average_size}"
unless current==''
puts result
output.puts result
end
total_size = 0
current = tablename
file_count = 0
end
file_count += 1 unless permissions[0] == 'd'
total_size += size.to_i
end
Lots of our files were small so I am going to experiment with different partitioning and compression schemes.