A Random Walk Through Idea Space: Finding information on Hive tables from HDFS

I was curious about our Hive tables total usage on HDFS and what the average filesize was with the current partitioning scheme so wrote this ruby script to calculate it.

current = ''
file_count = 0
total_size = 0

output = File.open('output.csv','w')

IO.popen('hadoop fs -lsr /user/hive/warehouse').each_line do |line|
  split = line.split(/\s+/)
  #permissions,replication,user,group,size,mod_date,mod_time,path
  next unless split.size == 8
  path = split[7]
  size = split[4]
  permissions = split[0]
  tablename=path.split('/')[4]
  if tablename != current
    average_size = file_count == 0 ? 0 : total_size/file_count
    result = "#{current},#{file_count},#{total_size},#{average_size}"
    unless current==''
      puts result
      output.puts result
    end
    total_size = 0
    current = tablename
    file_count = 0
  end
  file_count += 1 unless permissions[0] == 'd'
  total_size += size.to_i
end

Lots of our files were small so I am going to experiment with different partitioning and compression schemes.

Menu