New features in the Canopy Data Import Tool Version 1.1.0:
Support for Pandas v. 20, Excel / CSV export capabilities, and more
We’re pleased to announce a significant new feature release of the Canopy Data Import Tool, version 1.1.0. The Data Import Tool allows users to quickly and easily import CSVs and other structured text files into Pandas DataFrames through a graphical interface, manipulate the data, and create reusable Python scripts to speed future data wrangling. Here are some of the notable updates in version 1.1.0:
1. Support for Python 3 and PyQt
The Data Import Tool now supports Python 3 and both PyQt and PySide backends.
2. Exporting DataFrames to csv/xlsx file formats
We understand that data exploration and manipulation are only one part of your data analysis process, which is why the Data Import Tool now provides a way for you to save the DataFrame as a CSV/XLSX file. This way, you can share processed data with your colleagues or feed this processed file to the next step in your data analysis pipeline.
3. Column Sort Indicators
In earlier versions of the Data Import Tool, it was not obvious that clicking on the right-end of the column header sorted the columns. With this release, we added sort indicators on every column, which can be pressed to sort the column in an ascending or descending fashion. And given the complex nature of the data we get, we know sorting the data based on single column is never enough, so we also made sorting columns using the Data Import Tool stable (ie, sorting preserves any existing order in the DataFrame).
4. Support for Pandas versions 0.19.2 – 0.20.1
Version 1.1.0 of the Data Import Tool now supports 0.19.2 and 0.20.1 versions of the Pandas library.
5. Column Name Selection
If duplicate column names exist in the data file, Pandas automatically mangles them to create unique column names. This mangling can be buggy at times, especially if there is whitespace around the column names. The Data Import Tool corrects this behavior to give a consistent user experience. Until the last release, this was being done under the hood by the Tool. With this release, we changed the Tool’s behavior to explicitly point out what columns are being renamed and how.
6. Template Discovery
With this release, we updated how a Template file is chosen for a given input file. If multiple template files are discovered to be relevant, we choose the latest. We also sped up loading data from files if a relevant Template is used.
For those of you new to the Data Import Tool, a Template file contains all of the commands you executed on the raw data using the Data Import Tool. A Template file is created when a DataFrame is successfully imported into the IPython console in the Canopy Editor. Further, a unique Template file is created for every data file.
Using Template files, you can save your progress and when you later reload the data file into the Tool, the Tool will automatically discover and load the right Template for you, letting you start off from where you left things.
7. Increased Cell Copy and Data Loading Speeds Copying cells has been sped up significantly. We also sped up loading data from large files (>70MB in size).
Using the Data Import Tool in Practice: a Machine Learning Use Case
In theory, we could look at the various machine learning models that can be used to solve our problems and jump right to training and testing the models.
However, in reality, a large amount of time is invested in the data cleaning and data preparation process. More often than not, real-life data cannot be simply fed to a machine learning model directly; there could be missing values, the data might need further processing to remove unnecessary details and join columns to generate a clean and concise dataset.
That’s where the Data Import Tool comes in. The Pandas library made the process of data cleaning and processing has gotten easier and now, the Data Import Tool makes it A LOT easier. By letting you visually clean your dataset, be it removing, converting or joining columns, the Data Import Tool will allow you to visually operate on the data frame and look at the outcome of the operations. Not only that, the Data Import Tool is stateful, meaning that every command can be reverted and changes can be undone.
To give you a real world example, let’s look at the training and test datasets from the Occupancy detection dataset. The dataset contains 8 columns of data, the first column contains index values, the second column contains DateTime values and the rest contain numerical values.
As soon as you try loading the dataset, you might get an error. This is because the dataset contains a row containing column headers for 7 columns. But, the rest of the dataset contains 8 columns of data, which includes the index column. Because of this, we will have to skip the first row of data, which can be done from the Edit Command pane of the ReadData command.
After we set `Number of rows to skip` to `1` and click `Refresh Data`, we should see the DataFrame we expect from the raw data. You might notice that the Data Import tool automatically converted the second column of data into a `DateTime` column. The DIT infers the type of data in a column and automatically performs the necessary conversions. Similarly, the last column was converted into a Boolean column because it represents the Occupancy, with values 0/1.
As we can see from the raw data, the first column in the data contains Index values.. We can access the `SetIndex` command from the right-click menu item on the `ID` column.
Alongside automatic conversions, the DIT generates the relevant Python/Pandas code, which can be saved from the `Save -> Save Code` sub menu item. The complete code generated when we loaded the training data set can be seen below:
# -*- coding: utf-8 -*-
import pandas as pd
# Pandas version check
from pkg_resources import parse_version
if parse_version(pd.__version__) != parse_version('0.19.2'):
raise RuntimeError('Invalid pandas version')
from catalyst.pandas.convert import to_bool, to_datetime
from catalyst.pandas.headers import get_stripped_columns
# Read Data from datatest.txt
filename = 'occupancy_data/datatest.txt'
data_frame = pd.read_table(
filename,
delimiter=',', encoding='utf-8', skiprows=1,
keep_default_na=False, na_values=['NA', 'N/A', 'nan', 'NaN', 'NULL', ''], comment=None,
header=None, thousands=None, skipinitialspace=True,
mangle_dupe_cols=True, quotechar='"',
index_col=False
)
# Ensure stripping of columns
data_frame = get_stripped_columns(data_frame)
# Type conversion for the following columns: 1, 7
for column in ['7']:
valid_bools = {0: False, 1: True, 'true': True, 'f': False, 't': True, 'false': False}
data_frame[column] = to_bool(data_frame[column], valid_bools)
for column in ['1']:
data_frame[column] = to_datetime(data_frame[column])
As you can see, the generated script shows how the training data can be loaded into a DataFrame using Pandas, how the relevant columns can be converted to Bool and DateTime type and how a column can be set as the Index of the DataFrame. We can trivially modify this script to perform the same operations on the other datasets by replacing the filename.
Finally, not only does the Data Import Tool generate and autosave a Python/Pandas script for each of the commands applied, it also saves them into a nifty Template file. The Template file aids in reproducibility and speeds up the analysis process.
Once you successfully modify the training data, every subsequent time you load the training data using the Data Import Tool, it will automatically apply the commands/operations you previously ran. Not only that, we know that the training and test datasets are similar and we need to perform the same data cleaning operations on both files.
Once we cleaned the training dataset using the Data Import Tool, if we load the test dataset, it will intelligently understand that we are loading a file similar to the training dataset and will automatically perform the same operations that we performed on the training data.
The datasets are available at – http://ift.tt/2smby7Z
Ready to try the Canopy Data Import Tool?
Download Canopy (free) and click on the icon to start a free trial of the Data Import Tool today.
(NOTE: The free trial is currently only available for Python 2.7 users. Python 3 users may request a free trial by emailing canopy.support@enthought.com. All paid Canopy subscribers have access to the Data Import Tool for both Python 2 and Python 3.)
We encourage you to update the latest version of the Data Import Tool in Canopy’s Package Manager (search for the “catalyst” package) to make the most of the updates.
For a complete list of changes, please refer to the Release Notes for the Version 1.1.0 of the Tool here. Refer to the Enthought Knowledge Base for Known Issues with the Tool.
Finally, if you would like to provide us feedback regarding the Data Import Tool, write to us at canopy.support@enthought.com.
Additional resources:
- Download Canopy and start a free 7 day trial of the data import tool
- Canopy Data Import Tool Product Page
Related blogs:
- Using the Canopy Data Import Tool to Speed Cleaning and Transformation of Data & New Release Features
- Loading Data Into a Pandas DataFrame: The Hard Way, and The Easy Way
- Handling Missing Values in Pandas DataFrames: the Hard Way, and the Easy Way
- Enthought Announces Canopy 2.1: A Major Milestone Release for the Python Analysis Environment and Package Distribution (June 2017)
Watch a 2-minute demo video to see how the Canopy Data Import Tool works:
See the Webinar “Fast Forward Through Data Analysis Dirty Work” for examples of how the Canopy Data Import Tool accelerates data munging:
The post What’s New in the Canopy Data Import Tool Version 1.1 appeared first on Enthought Blog.
from Planet Python http://ift.tt/1dar6IN
via IFTTT