Using Data to Make lives better

DATA WRANGLING AND DATA ANALYSIS - A Hands on Approach

By Kashyap K N - September 11, 2018

Data wrangling in simple is conversion of a raw data to a semi structure data that can be used for deriving data patterns. It is one of the most crucial and starting step in analyzing a crude data.

Data scientists and Analysts spend about 80% of their time cleaning up the data and this is itself tells the significance of Data wrangling.

In this post we are going to look at a dataset from Kaggle, perform Data wrangling and do analysis of the cleansed data.

I have picked a dataset of fruits and vegetables prices from 2010-2018 in Kaggle. The goal is to predict the pattern of these fruits and vegetable prices across these 9 years time frame, so that it will be useful to find best time to cultivate a crop.

Before you start applying all the bleeding edge machine learning algorithms, it is essential to understand what information the data is conveying.

Let's first import the data and start working. I am using a jupyter notebook in this post to do data wrangling. If u do not have a jupyter notebook installed in your machine i would suggest you to use Google Colabs which is similar to jupyter notebook and you can just import all the libraries that you wish to use as shown in below screen shot.

Next, lets look at the data and its size.

Here we are looking at the first 5 rows of the dataset to see what it represents . The dataset consists of 4 columns and about 8.5 lakh rows. The column "Item Name" represents the name of the vegetable or the fruit and it has it's price for a particular date in "price" and "Date" column. The column "datesk" is date without any delimiter like '-' or '/'. It is important to have the date value in normalized form as shown above for applying machine learning algorithms.

Below screen shot shows the last 5 rows of the table, we can see that the values in the "Date" column are not uniform. If you observe above screen shot, you have date and time stamp but here there is only date.

For our analysis, time of the day is not a very significant attribute. So, we can ignore the time part and convert all the date values into a standard format using pandas to_datetime( ) as shown below.

Next thing we notice is NaN which represents Not a Number, which here means the values are missing. Let's find out the number of missing values in each column.

As you can see there are 7839 missing values in "Item Name" column. We cannot determine the price or do analysis without knowing the name of the vegetable or fruit, hence we need to remove all the rows with missing values in "Item Name" column.

Next lets look at "price" column. It has 317017 missing values. We cannot remove these because these values constitute about 36% of our entire dataset. It is a large chunk of data which could be useful. We shall see about this later down in this post.

Now let's see how many different types of fruits and vegetables are there in this dataset.

From the above figure we can see that the dataset has about 330 different fruits and vegetables.

We can use groupby( ) function and aggregation( ) function as shown below and find many things like number of entries of each fruit or vegetable, the maximum, minimum price, the average price etc.,

For Ex: 'Amla' has 2048 entries , it's minimum price is 30/kg and maximum price is 600/kg. If you observe there is high difference between minimum and maximum price.

One of insights we can get from this analysis is to see which fruit or vegetable has high demand during the year . This can be found by calculating the average price of a fruit or vegetable on a monthly basis . When in a particular month the price is high, it could mean high demand and low supply(there could also be many other reasons like the fruit or vegetable could not be grown on that particular month etc., Ex: Mango's can be grown only on summer season. But in this analysis our concentration is price centric.

First we need to get the month from the Date column. This can be done without creating a new column but it is much easier to do it this way for better understanding.

Now let us take a fruit and find which is the best month in the year to cultivate it. Let us consider 'Amla' and see which month has the highest demand based on the price.

As you can see in the month of May the price is high compared other months in the year .We can also visualize this by plotting it using a bar graph.

Below graph shows how the price of Amla has varied from 2010 to 2018.

Like wise, we can find the best month to cultivate a fruit or vegetable which could solve the common problems like 'high supply low demand' or 'low supply high demand' .

We can also predict the prices for the coming years using the previous year's data so that it can help farmers who need information about which crop to grow based on the demand in the coming months.

This is my first blog, so please add your comments / suggestions. This would motivate me to further research and add more information.

Thanks for reading. Happy Learning!!!

0 Comments

Report Abuse

DATA WRANGLING AND DATA ANALYSIS - A Hands on Approach

About