Read Files In HDFS Directory using PySpark




Sometimes I need to get a list of files in an HDFS directory using PySpark.
I found a block of code that is working. So I think I need to record it here for future need.


hadoop = spark._jvm.org.apache.hadoop
fs = hadoop.fs.FileSystem
conf = hadoop.conf.Configuration()
path = hadoop.fs.Path('/data/customer/raw/')
for f in fs.get(conf).listStatus(path):
    print(f.getPath())

Comments