Sie können aus diesen drei Ansatz wählen:
- Sie Shell-Skript schreiben können, um diese Aufgabe zu tun
- Sie mapreduce Job mit Partition-er Klasse
- Sie partitioniert erstellen Tabelle hive schreiben kann und gelten Partition für Jahr, Monat und Tag, aber dann wird dir name
partition column name=
als Präfix in dir Namen haben: /data/year=2016/month=01/date=07
Lassen Sie mich wissen, welche jemals approa ch Sie bevorzugen, ich werde die Antwort mit einem darauf basierenden Beispiel aktualisieren.
Update mit Shell-Skript Lösung:
Gegeben zwei Eingangs-/Quelldateien mit dem gleichen Inhalt in hdfs:
[[email protected] ~]$ hadoop fs -ls /user/cloudera/test_dir
Found 2 items
-rw-r--r-- 1 cloudera cloudera 79 2016-08-02 04:43 /user/cloudera/test_dir/test.file1
-rw-r--r-- 1 cloudera cloudera 79 2016-08-02 04:43 /user/cloudera/test_dir/test.file2
Shell-Skript:
#!/bin/bash
# Assuming src files are in hdfs, for local src file
# processing change the path and command accordingly
# if you do NOT want to write header in each target file
# then you can comment the writing header part from below script
src_file_path='/user/cloudera/test_dir'
trg_file_path='/user/cloudera/trgt_dir'
src_files=`hadoop fs -ls ${src_file_path}|awk -F " " '{print $NF}'|grep -v items`
for src_file in $src_files
do
echo processing ${src_file} file...
while IFS= read -r line
do
#ignore header from processing - that contains *id*
if [[ $line != *"id"* ]];then
DATE=`echo $line|awk -F " " '{print $NF}'`
YEAR=`echo $DATE|awk -F "-" '{print $1}'`
MONTH=`echo $DATE|awk -F "-" '{print $2}'`
DAY=`echo $DATE|awk -F "-" '{print $3}'`
file_name="file_${DATE}"
hadoop fs -test -d ${trg_file_path}/$YEAR/$MONTH/$DAY
if [ $? != 0 ];then
echo "dir not exist creating... ${trg_file_path}/$YEAR/$MONTH/$DAY "
hadoop fs -mkdir -p ${trg_file_path}/$YEAR/$MONTH/$DAY
fi
hadoop fs -test -f ${trg_file_path}/$YEAR/$MONTH/$DAY/$file_name
if [ $? != 0 ];then
echo "file not exist: creating header... ${trg_file_path}/$YEAR/$MONTH/$DAY/$file_name"
echo "id name timestamp" |hadoop fs -appendToFile - ${trg_file_path}/$YEAR/$MONTH/$DAY/$file_name
fi
echo "writing line: \'$line\' to file: ${trg_file_path}/$YEAR/$MONTH/$DAY/$file_name"
echo $line |hadoop fs -appendToFile - ${trg_file_path}/$YEAR/$MONTH/$DAY/$file_name
fi
done < <(hadoop fs -cat $src_file)
done
manageFiles.sh
Scrip t lief als:
[[email protected] ~]$ ./manageFiles.sh
processing /user/cloudera/test_dir/test.file1 file...
dir not exist creating... /user/cloudera/trgt_dir/2013/01/01
file not exist: creating header... /user/cloudera/trgt_dir/2013/01/01/file_2013-01-01
writing line: '1 Lorem 2013-01-01' to file: /user/cloudera/trgt_dir/2013/01/01/file_2013-01-01
dir not exist creating... /user/cloudera/trgt_dir/2013/02/01
file not exist: creating header... /user/cloudera/trgt_dir/2013/02/01/file_2013-02-01
writing line: '2 Ipsum 2013-02-01' to file: /user/cloudera/trgt_dir/2013/02/01/file_2013-02-01
dir not exist creating... /user/cloudera/trgt_dir/2013/03/01
file not exist: creating header... /user/cloudera/trgt_dir/2013/03/01/file_2013-03-01
writing line: '3 Ipsum 2013-03-01' to file: /user/cloudera/trgt_dir/2013/03/01/file_2013-03-01
processing /user/cloudera/test_dir/test.file2 file...
writing line: '1 Lorem 2013-01-01' to file: /user/cloudera/trgt_dir/2013/01/01/file_2013-01-01
writing line: '2 Ipsum 2013-02-01' to file: /user/cloudera/trgt_dir/2013/02/01/file_2013-02-01
writing line: '3 Ipsum 2013-03-01' to file: /user/cloudera/trgt_dir/2013/03/01/file_2013-03-01
[[email protected] ~]$ hadoop fs -cat /user/cloudera/trgt_dir/2013/03/01/file_2013-03-01
id name timestamp
3 Ipsum 2013-03-01
3 Ipsum 2013-03-01
[[email protected] ~]$
Danke. Hive-Datei-Format sieht gut aus nur Problem mit Hive partitionierte Tabelle ist es wird die Datei im ORC-Dateiformat speichern, möchte ich Textformat verwenden. Kannst du mir auch sagen, was das Shell-Skript ist, das die Aufgabe für mich erledigt? –
siehe aktualisierte Antwort –