# Welcome to pyobjectify [![PyPI](https://img.shields.io/pypi/v/pyobjectify?style=flat-square&color=222222)](https://pypi.org/project/pyobjectify) Bridge the gap across the different file formats and streamline the process to accessing ingested data via Python objects ![license](https://img.shields.io/badge/license-MIT-green?style=flat-square&color=022169) ![issues](https://img.shields.io/github/issues/wu-rymd/pyobjectify?style=flat-square&color=841C1C) [![codecov](https://codecov.io/gh/wu-rymd/pyobjectify/branch/main/graph/badge.svg?token=410L0PN9UC)](https://codecov.io/gh/wu-rymd/pyobjectify) ![build](https://img.shields.io/github/actions/workflow/status/wu-rymd/pyobjectify/build.yml?style=flat-square) ## Overview Open data is abound. For example, NYC Open Data has over 3,000 datasets spanning over 97 agencies in New York City. This data comes in many different formats, including CSV, JSON, XML, XLS/XLSX, KML, KMZ, Shapefile, GeoJSON, JSON, and more. In order to import and analyze the data in Python involves sending a request to download the raw data, then converting it into a Python object so that methods can be used to parse its contents. However, this process varies across the many different data types. This project aims to streamline this process and bridge the gap across the different file formats to allow the end user to get started on data analytics more quickly with a quick function call. ## Install from pip ``` pip install pyobjectify ``` ## Quick start ```python import pyobjectify import pandas as pd json_dict = pyobjectify.from_url("https://bit.ly/42KCUSv") # URL holds JSON data, returns data in dict json_df = pyobjectify.from_url("https://bit.ly/42KCUSv", pd.DataFrame) # User-specified output data type ``` # [Autogenerated documentation (link)](source/pyobjectify.html) ## Supported types #### Connectivity types - Local files (_e.g._ `./relative/example.json`, `/absolute/path/example.json`) - Online, static (_e.g._ `https://some.website/example.json`, `http://bit.ly/some-json-endpoint`) For example, at the moment, a data stream from the Internet is not supported. #### Resource (input) data types - JSON - CSV - TSV - XML - XLSX #### Supported conversions - JSON → `dict`, `list`, `pandas.DataFrame` - CSV → `list` - TSV → `list` - XML → `dict` - XLSX → `dict` --- # Examples The main method that the end user would typically use is the `from_url()` method. This method takes two parameters: a URL to a resource, and optionally, a user-specified data type of the output. You can use this method like in the Quick start example above: ```python import pyobjectify import pandas as pd json_dict = pyobjectify.from_url("https://bit.ly/42KCUSv") # URL holds JSON data, returns data in dict json_df = pyobjectify.from_url("https://bit.ly/42KCUSv", pd.DataFrame) # User-specified output data type ``` Note that if the resource cannot or is not implemented to convert to the user-specified output data type, a `TypeError` will be raised. The supported resource (input) data types and supported conversions are clearly delineated above. ### Subroutines In addition to the main `from_url()` method, which provides a one-stop-shop functionality of the whole library, there are subroutines that are exposed publically so the user can tweak the more granular operations: - `url_to_connectivity(url)` - `retrieve_resource(url, connectivity)` - `get_resource_types(resource)` - `get_conversions(in_types, out_type)` - `convert(resource, conversions)` In fact, the `from_url()` method runs all of these subroutines, in that order. #### `url_to_connectivity(url)` This function is used to get the resource connectivity type of the resource, given the URL. **Example:** ```python connectivity = url_to_connectivity("https://bit.ly/42KCUSv") print(connectivity) """ """ ``` `Connectivity` is an enumeration of the supported file connectivity types: `ONLINE_STATIC` and `LOCAL`. (At the moment, a data stream from the Internet is not supported.) A `TypeError` will be raised if the connectivity type is not supported. #### `retrieve_resource(url, connectivity)` This function is used to retrieve the resource at the URL, which has the specified connectivity type. A `TypeError` will be raised if the resource connectivity type is not supported. **Example:** ```python url = "https://bit.ly/42KCUSv" connectivity = url_to_connectivity(url) # resource = retrieve_resource(url, connectivity) print(resource) """ <__main__.Resource object at 0x104be6fd0> """ ``` The `Resource` class stores some metadata about the resource. It stores the URL of the resource, the connectivity type, the HTTP response, and the response in plaintext. A `TypeError` will be raised if the connectivity type is not supported. #### `get_resource_types(resource)` This function is used to get a list of the possible input types of the resource. Heuristics are used to determine possible data types. **Example:** ```python url = "https://bit.ly/42KCUSv" connectivity = url_to_connectivity(url) # resource = retrieve_resource(url, connectivity) # <__main__.Resource object at 0x104be6fd0> in_types = get_resource_types(resource) print(in_types) """ [] """ ``` `InputType` is an enumeration of the supported input data types. If the input type cannot be determined, a `TypeError` will be raised. #### `get_conversions(in_types, out_type=None)` This function is used to get a list of the possible conversions to output data types, given the list of the probable input data types of the resource. If there are no possible conversions, a `TypeError` is raised. **Example:** ```python url = "https://bit.ly/42KCUSv" connectivity = url_to_connectivity(url) # resource = retrieve_resource(url, connectivity) # <__main__.Resource object at 0x104be6fd0> in_types = get_resource_types(resource) # [] conversions = get_conversions(in_types) print(conversions) """ [ (, ), (, ), (, ) ] """ ``` This function returns a list of (in, out) conversion tuples. Since the only probable input data type was calculated to be JSON, the three possible/supported conversions are to Python `dict` or `list`, or pandas `DataFrame`. #### `convert(resource, conversions)` This function is used to convert the resource data through the list of possible conversions. The first successful conversion from the probable resource type to an output data type is returned. If all conversions were unsuccessful, a `TypeError` is returned. **Example:** ```python url = "https://bit.ly/42KCUSv" connectivity = url_to_connectivity(url) # resource = retrieve_resource(url, connectivity) # <__main__.Resource object at 0x104be6fd0> in_types = get_resource_types(resource) # [] conversions = get_conversions(in_types) # [(, ), (, ), (, )] output = convert(resource, conversions) print(output) """ {'data': [{'condition': 'Clear sky', ... """ print(type(output)) """ """ ``` Note that the listed order of subroutines can be run by just using `from_url("https://bit.ly/42KCUSv")`. However, as shown, the inner workings can be modified to the end user's liking by calling the exposed subroutines.