This tutorial demonstrates how to read CSV data from remote file using its URL in TensorFlow.
Abstract
TensorFlow is a library for large scale machine learning applications. Hence it may often be required to read data from Remote files using URLs.
TensorFlow comes with in-built utilities to enable us to read the remote file data which we will talk about in rest of this tutorial.
Pre-requisites
Here are the pre-requisites to ensure that you can easily follow and get best of tutorial -
- Python Installation or IDE such as PyCharm to write and execute programs
- Basics of Python Programming
- Basics of TensorFlow. If you are not familiar with TensorFlow, it is strongly recommended to go through Getting Started with TensorFlow.
Remote File Data Definition
We will be reading Appliances energy consumption data from UCI Machine Learning Repository.
Here is the path to this dataset in UCI Machine learning Repository -
http://archive.ics.uci.edu/ml/machine-learning-databases/00374/energydata_complete.csv
Reading Remote File Data using TensorFlow
Here is the program to read data from remote url mentioned in previous section. This program starts with defining CSV column types, creating a text line dataset using TensorFlow, decoding that text line into CSV and finally iterating over dataset using one shot iterator in TensorFlow session.
import collections
# Import TensorFlow library
import tensorflow as tf
# Define CSV column data types.
# Since CSV readers are sensitive to ordering of columns, OrderedDict is used
column_types = collections.OrderedDict([
("date", [""]),
("Appliances", [0]),
("lights", [0]),
("T1", [0.0]),
("RH_1", [0.0]),
("T2", [0.0]),
("RH_2", [0.0]),
("T3", [0.0]),
("RH_3", [0.0]),
("T4", [0.0]),
("RH_4", [0.0]),
("T5", [0.0]),
("RH_5", [0.0]),
("T6", [0.0]),
("RH_6", [0.0]),
("T7", [0.0]),
("RH_7", [0.0]),
("T8", [0.0]),
("RH_8", [0.0]),
("T9", [0.0]),
("RH_9", [0.0]),
("T_out", [0.0]),
("Press_mm_hg", [0.0]),
("RH_out", [0.0]),
("Windspeed", [0.0]),
("Visibility", [0.0]),
("Tdewpoint", [0.0]),
("rv1", [0.0]),
("rv2", [0.0])
])
def map_line_to_dict(line):
"""
Converts input line to CSV and then to Dictionary.
:param line: to parse as CSV and to be converted to dict
:return: Dictionary of CSV header names and their values
"""
mapped_values = tf.decode_csv(line, column_types.values())
return dict(zip(column_types.keys(), mapped_values))
# Path to dataset
URL = "http://archive.ics.uci.edu/ml/machine-learning-databases/00374/energydata_complete.csv"
path = tf.contrib.keras.utils.get_file(URL.split("/")[-1], URL)
# Following line does not read data yet, it is more
dataset = (tf.data
.TextLineDataset(path) # Create text line Dataset from file URL
.skip(1) # Skip first header line
.map(map_line_to_dict) # Convert line to dictionary
.cache() # Cache conversion of line to dictionary for optimized read
)
# Make one shot iterator to get all data
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()
# Get TensorFlow session to get next element
session = tf.Session()
try:
while True: # Running an infinite loop to read all the data from iterator
data_dict = session.run(next_element)
print data_dict
except tf.errors.OutOfRangeError as oore: # Error will be thrown once we reach end of iterator
pass
Thank you for reading through the tutorial. In case of any feedback/questions/concerns, you can communicate same to us through your comments and we shall get back to you as soon as possible.