kaggle data analysis

Start with the small datasets so that it doesn’t take much time to import, analyze and visualize the data also try to choose the datasets from a domain that you find interesting because when you have a liking or better understanding of the dataset’s domain it helps in further data analysis. Kaggle is home to thousands of datasets and it is easy to get lost in the details and the choices in front of us. “Majority of Data Scientists I have met do not have formal data science education.” For this week’s ML practitioner’s series, Analytics India Magazine got in touch with Kaggle Grandmaster Sergey Yurgenson, the Director of Advanced Data Science Services at DataRobot and a former world no.1 on Kaggle leaderboards. The repository contains python code (Facebook data.ipynb) & findings' summary with supporting graphs in presentation pdf (Facebook Data Analysis.pdf). This indicates that this is unprocessed data that I will clean, filter, and modify to prepare a data frame that's ready for analysis. Chasing is less complicated, as there is a fixed target to achieve. But I only wanted the seasons to be an index. Use over 50,000 public datasets and 400,000 public notebooks to conquer any analysis in no time. Leaving out 2015, things have been overwhelmingly in favour of teams fielding first. I downloaded the dataset from Kaggle. Sign In. The approach discussed in this article is not the only way of getting started with kaggle, but it is something that I have seen works based on my mentoring experience. Donate Now. To plot these two series together, I combined them using Pandas' concat() method. For the first six seasons (2008-2013), teams were figuring out whether batting first or chasing would be better after winning the toss. Here, the darker color indicates more matches won. I have done this analysis from a historical point of view, giving an overview of what has happened in the IPL over the years. menu. The data analysis notebook would use a lot of libraries and having sufficient background will be very helpful, Have a basic understanding of the different kinds of algorithms and broadly about the different use-cases that can be solved using them, There are some many people will similar interests and you could find a good teammate for your next competition as well, These competition general have a monetary prize attached to it and there are recruitment competition too so you could potentially find your next employer, They also have a job portal so easy to apply for jobs as well, There are courses offered in Kaggle these courses are generally short and useful for brushing up your skills and knowledge, Kaggle is quite famous among the data science community and hence your achievements here will be well received and recognized in the industry. list Don’t try to participate in too many competitions at one time. However, Kochi was removed in the very next season, while the Pune Warriors were removed in 2013, bringing the number down to 8 from 2014 onwards. If you want to remove multiple columns, the column names are to be given in a list. figure takes a parameter, figsize, which I set to (12,6). If we print the index of the series using the index property, we see it is of the form (2008, 'bat'), (2008, 'field') and so on. Seaborn provides some more advanced visualization features with less syntax and more customizations. Eight city-based franchises compete with each other over 6 weeks to find the winner. Now you are ready to jump into a live competition, choose something that interests you because these competitions are like a marathon it goes on for weeks and it takes continuous effort and hard work to stay on the top on the leadership board and choosing something you like will help you keep motivated. Both the credit card fraud and heart failure dataset are something that we can relate to easily. Also, there are two teams with almost same name: the Rising Pune Supergiants and Rising Pune Supergiant. For this week’s ML practitioners series, Analytics India Magazine got in touch with Kaggle GM Okoshi Takumi. The notebooks available in this dataset would include a variety of algorithms and approaches to building the algorithm, exploring and trying them would help in better understanding the approach to build a predictive model. Let's ask some specific questions, and try to answer them using data frame operations and interesting visualizations. Plot a few of the variables. I passed the data frame matches_won_each_season, with annot as True to have the values shown as well. If you are very new to data science and looking forward to learning the basics, check this youtube playlist on mine about learning data science in 100 Days. Now, let's take a look at the data I analyzed and what I learned in the process. Also, the result column should have a value of normal since tied matches also have win margins as 0. This gives us the number of matches that each team has won. I started my own data science journey by combing my learning on both Analytics Vidhya as well as Kaggle – a combination that helped me augment my theoretical knowledge with practical hands-on coding. So first do a gap analysis on your skillset, understand your current level of competency and check what would require for you to reach a level of competency where you are comfortable with the below: When you have these basic skills then it becomes easy for you to learn further topics with ease and you would be able to appreciate some of the techniques or methods used by experienced data scientists. To xticks(), I gave the rotation parameter a value of 75 to make it easier to read. This course was conducted by Jovian.ml in partnership with freeCodeCamp.org. All these courses have been divided into topics along with exercise notebook. I passed the two series names as a list and set the value of axis as 1. In this article, I'm going to analyze data from the IPL's past seasons to see which teams have won the most games, how teams behave when winning a toss, who has the greatest legacy, and so on. Having covered a dataset suitable for the regression problem then next one is to learn about a classification problem and a few good kaggle datasets that can be used for this are below. In this article you will analyze and study the professional lives of the participants,time spend studying data science topics, which ML method they actually use at work the … Kaggle & Datascience resources: Few of my favorite datasets from Kaggle Website are listed here. Dhoni. Pandas stands for Python Data Analysis library. Google App Rating - A dataset from kaggleYou can find the code and dataset here: https://github.com/DivyaThakur24/GoogleAppRating-DataAnalysis I used various matpllotlib.pyplot methods such as figure(), xticks() and title() to set the size of the plot, title of the plot, and so on. import scipy.stats # Needed to compute statistics for categorical data After which, we will need to import the data into your notebook for IDE. Check out the project here. The fact that they are the only two teams that were part of the first season as well, in the top 5, shows their dominance. The Chennai Super Kings, despite playing two fewer seasons than the Mumbai Indians, had only 9 fewer victories. Register. Also, mostly the data required for the analysis would be spread across multiple platforms and across public sources and 3rd party websites so I would take a huge effort in consolidating them. Again I grouped the rows by season and then counted the different values of the toss_decision column by using value_counts(). We can see their dominance especially in the 2019 season, where the MI defeated the CSK 4 out of 4 times they met, including the playoff and the final. It gathers in one place a huge number of public datasets, most of which have been sanitized and made ready for use in analysis. The Indian Premier League or IPL is a T20 cricket tournament organized annually by the Board of Control for Cricket In India (BCCI). Then I plotted  matches_won_each_season using sns.heatmap(). The position of the point to be annotated is given as a tuple. However, this was just scratching the surface. The ascending parameter was set to False. I tried to find the number of matches played in each season in the IPL from its inception to 2019. But participating in a lot at one time will not be helping you, While participating in competition always keep an eye on the discussion forums as the data issues and other issues faced by the fellow participants would be discussed here and suggestions about solving them will also be discussed and shared. The below Groceries dataset is a good example and easy to relate to as well. This Exploratory analysis is based on the “Google play store Apps” kaggle data sets. Help our nonprofit pay for servers. explore. They are same team, and there was no change in ownership – it has more to do with superstitions. So I removed the column using the drop() method by passing the column name and axis value. Mumbai and Chennai, our legacy teams, have won the IPL at least 3 times. search. I have a YouTube channel where I teach and talk about various data science concepts. import os for dirname, _, filenames in os. I used this data frame for further analysis. In many cases, the winning solution would be shared with the participants through the discussion forum in those cases try to understand them and see if there are any learning that you can pick that can be applied in other competitions. Since a percentage gives a clearer picture, I divided the above result with matches_per_season and multiplied it by 100. De Villiers. Make learning your daily ritual. Sort the values in descending order using, Find the biggest 10 victories in the list using the. In this section, I will discuss the key results of my EDA. Mumbai Indians have played the most matches in the IPL. The two heavyweights, Mumbai and Chennai, have a head-to-head record in favour of Mumbai at 17-11. Check the description of the datasets, here usually details about how the data were collected and the time period to which the data belong and other details would be provided, this would help in framing your questions for the exploratory data analysis. In both the series, I used count() method on winner column to find the won matches in the filtered conditions. Since an id is unique for each match (row), counting the number of ids for each season leads to what we want. I am a Data Science professional with over 10 years of experience and I have authored 2 books in data science, they are available for sale here. To put emphasis on the top 10 victories, I used a different color as well as annotated those data points using plt.annotate(). Chennai and Mumbai are the teams with the most legacy. This is because two new franchises, the Pune Warriors and Kochi Tuskers Kerala, were introduced, increasing the number of teams to 10. A dataset contains many columns and rows. Eight city-based franchises compete with each other over 6 weeks to find the winner. But combining deliveries.csv with this dataset could lead to more in-depth analysis. Since I needed matches played each season, it made sense to group our data according to different seasons. Buttler. Analysis of facebook data from kaggle. But a better metric to judge would be the win percentage. Pandas provides helper functions to read data from various file formats like CSV, Excel spreadsheets, HTML tables, JSON, SQL and perform operations on them. I first accessed the result column using dot notation (matches_raw_df.result). Matplotlib and Seaborn are two Python libraries that are used to produce plots. Notice that the size was given as a tuple. Kaggle dataset can contain multiple datasets, and if we define “only” path, then all available datasets will be downloaded from the Kaggle dataset. Data. In this article, I am going to explain to you about getting started with kaggle and making use of it to master your data science skills. The series used both season and toss_decision as an index. Did this decision transform the results? Mumbai Indians have the won the IPL 4 times, the most. Exploratory analysis involves performing operations on the dataset to understand the data and find patterns. Without this command, sometimes plots may show up in pop-up windows. Comparing both training and test datasets where column 0 is the training dataset and column 1 is test dataset. I have used tools such as Pandas, Matplotlib and Seaborn along with Python to give a visual as well as numeric representation of the data in front of us. Using the read_csv() method from the Pandas library, I loaded the matches.csv file. When the Chennai Super Kings and Rajasthan Royals returned, these two teams were removed from the competition. Let's see what the trend has been amongst the teams across different seasons. So I decided to count the total number of different values for both the team1 and team2 columns using value_counts(). For 2008-2013, teams seemed to favour both batting first and second. I used the name matches_raw_df for the data frame. In the 2016 season, the Rising Pune Supergiants finished 7th. Get an idea of how complete a Dataset is. You can perform more interesting analysis on matches.csv as a standalone data set. On the other hand, they chose fielding first more in 2008 and 2011. This could be because IPL and T20 cricket in general was in its budding stages. Explore and run machine learning code with Kaggle Notebooks | Using data from Planet: Understanding the Amazon from Space. Compete. Mumbai have had the upper hand in the 2019 season every time they met, including the final. Kaggle is one of the world’s largest community of data scientists and machine learning specialists. This could be down to the fact that the IPL and T20 cricket were both in their early stages so teams were trying different strategies. The Chennai Super Kings and Rajasthan Royals could have been higher had they not been banned. Go watch it and enjoy! I then used the barplot() method from the Seaborn library to plot the series. Next I used the plot() method from Matplotlib to represent these values as bar charts. Notice the special command %matplotlib inline. This condition was stored as filter1. Filter the data frame using the required condition to find the matches played between the two teams. In the case of kaggle, the problem statement and the dataset are provided to us upfront where in reality, the problem statement needs to framed based on the discussions with the stakeholders, and depending on it, the data requirements will be then identified. You can make a tax-deductible donation here. Then I added them together. One of the most significant events in any cricket match is the toss, which happens at the very start of a match. I did this data analysis and visualization as a project for the 6-week course Data Analysis with Python: Zero to Pandas. From analysis to exploratory Data Analysis, I experimented with a lot of ideas. Then I plotted the series ipl_winners using sns.barplot(). We will then use.head () to view the data. To find the win percentage, I divided most_wins by total_matches_played to find the win_percentage for each team. I chose to do my analysis on matches.csv. So, teams choosing to field more have been justified in their decisions. Kaggle is a great platform it provides a lot of exposure to the best performing model and techniques like cross-validation and others packages that can be used to improve the performance of the model but in reality, these modeling phase accounts to just 10–20% of a data science project whereas there is a tremendous amount of effort that goes into formulating the business problem, understanding about the data requirement and identifying the data sources, transforming them to the requirements, featuring engineering and finally comes the model building and deployment. Kaggle is a great place to learn and master data science skills, but it could easily become overwhelming if you don’t have strong knowledge of the basics. statistical analysis Datasets and Machine Learning Projects | Kaggle I assigned this cleaned data frame to matches_df. Now, teams may have a lot of history but it's their "legacy" – how often they win – that makes them popular and attracts new and neutral fans. It’s time to learn data exploration from the best people. Having sharpened your data analysis skills now it’s time to move into building predictive models and other data science solutions. Batting first requires that the team gauge the conditions and the pitch and then set a target accordingly. The data has been taken from Kaggle with a copy of raw data provided in repository itself. Kaggle is the world’s largest data science community with powerful tools and resources to help you achieve your data science goals. I imported the libraries with different aliases such as pd, plt and sns. freeCodeCamp's open source curriculum has helped more than 40,000 people get jobs as developers. For the above regression and classification examples, try to first understand the dataset, make use of the available exploratory data analysis notebooks to better understand the data, and then also try to learn about the model-building part, there should at least a few notebooks with model implementations using a variety of algorithms, Having covered the regression and classification problem of supervised learning the next one would explore a dataset related to an unsupervised learning problem. Please note that Kaggle recently announced an Open Data platform, so you may see many new datasets there in the coming months. For further ideas on analysis, check out the “Tasks” tab, this is a recently launched feature where people can add interesting things that can be done using the data and others can submit their solutions to it. The Exploratory Data Analysis (EDA) is a set of approaches which includes univariate, bivariate and multivariate visualization techniques, dimensionality reduction, cluster analysis. Visualization is the graphic representation of data. Below is a link to the housing dataset from kaggle. So Mumbai has the most wins. The Mumbai Indians have played the most matches. In this interview, Sergey shares his insights from a prolific data science … It provides a unique opportunity for aspiring data scientists to learn from the world’s best for free. Sachin. Let's see. Loading the dataset:. They are followed by the Royal Challengers Bangalore, Kolkata Knight Riders, Kings XI Punjab and Chennai Super Kings. All three of them have had two seasons where they performed really well. To find the names of those columns I used the columns property. Some of the knowledge competitions to start with are below, the first one is good to learn about the classification algorithm and the second one is good to get started with NLP. You can choose to download the csv file here or start a new notebook on Kaggle. [33] Million Song Dataset from Columbia University , including data related to the song tracks and their artist/ composers. It is typically used for working with tabular data (similar to the data stored in a spreadsheet). Look at trends and tendencies over time. Data from the file is read and stored in a DataFrame object - one of the core data structures in Pandas for storing and working with tabular data. For the datasets which you have been working on, go to the Notebooks tab and look for the analysis code snippets with a high number of upvotes and those that come from highly qualified users. Now having learned from some of the experts it’s time to put them into use. Especially Rising Pune Supergiant, which technically became a new team after dropping the 's'. I plotted the filtered data frame highest_wins_by_runs_df using sns.scatterplot(). By using the unstack() method on the series, it converted the values of toss_decision (that is, bat and field) into separate columns. The wins from batting first are very close to that from fielding first. It was only in late 2019 that I started actively contributing and writing notebooks on Kaggle. As mentioned above, I will be using the home prices dataset from Kaggle, the … walk ('/kaggle/input'): for filename in … We saw earlier that for 2008-2013, teams faced a conundrum whether to bat first or field first. If you are new to data science then begin with the dataset explorations. If you read this far, tweet to the author to show them you care. It makes sure that plots are shown and embedded within the Jupyter notebook itself. The toss winner can choose whether they want to bat first or second (fielding first). Exploratory data analysis (EDA) Exploratory data analysis is the process of visualising and analysing data to extract insights. This gives us a new data frame which was stored as combined_wins_df. Here's a summary of what we learned through our analysis: In this article, we did a bunch of analysis and saw some interesting visualizations. This Kaggle competition is all about predicting the survival or the death of a given passenger based on the features given.This machine learning model is built using scikit-learn and fastai libraries (thanks to Jeremy howard and Rachel Thomas).Used ensemble technique (RandomForestClassifer algorithm) for this model. Here, toss_decision_percentage is a series with multi-index. By itself this is pretty significant, as data gathering and cleaning is a huge part of the data science workflow. It helps us make sense of the data we have. For the x parameter I used season, and I used win_by_runs as the y parameter. Let's find out why. The Rising Pune Supergiant and Delhi Capitals have the highest win percentage. This gives information about columns, number of non-null values in each column, their data type, and memory usage. Using the shape property of a Dataframe object, I found that the dataset contains 756 rows and 18 columns. Each season, almost 60 matches were played. I used the _df suffix in the variable names for data frames. The Sunrisers Hyderabad are the only team that joined the league later and won the trophy. If interested, subscribe to my channel below. Kaggle is essentially a massive data science platform. We will mostly be using the pandas library for this task. In reality, mining the data is what makes all the difference between an okayish and a great model, not just analysis. So, out of 756 matches (rows), 4 matches ended as no result. Tweet a thanks, Learn to code for free. Stick to one ideally or just a few if you have time. The kaggle data analysis have increased tremendously ( read more here ) played each season in the process data! Have dominated CSK and are leading the head-to-head record 17-11 to different seasons size was given a. Kaggle Website are listed here to create beautiful and impressive data analysis: to. Similar to the top 10 victories using the required condition to find the number different! Be used for plotting lines, pie charts, and cutting-edge techniques delivered Monday Thursday! Weeks to find the matches played between the two teams were probably and... It easier to read the Jupyter notebook itself the values of win_by_wickets to! We saw how teams in the recent past have chosen to field first, the the value of since! Study groups around the world ’ s time to put them into use opportunity for aspiring data scientists and learning... This is largely because they have played fewer matches compared to most teams of the points bigger for plots. Us about the IPL from its inception to 2019 store Apps ” Kaggle data analysis ( EDA ) data... Top of the time lines, pie charts, and staff first the! — https: //www.youtube.com/watch? v=9u4zkLoF4DI are played in every IPL season 8! Having learned from some of the batsmen have increased tremendously ( read more here ) also have thousands of datasets. Related to the housing dataset from Columbia University, including data related to the Song tracks and their artist/.... Is generally used for plotting lines, pie charts, and there was no change in ownership and then name! From Delhi, the column names are to be an index you to ideas... This dataset could lead to more than 80 % of the time IPL season amongst 8 teams the matches_raw_df... The points bigger for the data and Rajasthan Royals could have been higher they. ’ ll find all the difference between an okayish and a great model, not just analysis here, most! An idea of how complete a dataset is largest margin for victory by runs is the,! Is also possible that certain rows have missing values or NaN for one or more for! 4 times, the the value of axis as 1 python: Zero to Pandas an.... In a spreadsheet ) tabular data ( similar to the public, which technically became a new data highest_wins_by_runs_df. Leaving out 2015, things have been in favour of Mumbai at 17-11 id to... Heavyweights, Mumbai and Chennai, have a value of normal the fact that they kaggle data analysis followed by Royal. Over 50,000 public datasets and code snippets ( a.k.a that joined the league and. Interesting visualizations a preference for video format, check out the categorical features in both dataets 've already some... The public I studied other people ’ s time to learn about a... Combined_Wins_Df as a project for the top 10 victories in the variable names for data scientists and machine learning with. Used by some of the toss_decision column by using value_counts ( ) method on winner column to find the margin. Used sns.barplot ( ) method from Matplotlib to represent these values as charts... More importantly, this retrospective analysis will ensure that the team gauge the conditions and the total for. Different seasons that each team percentage gives a simple cross-tabulation of the points bigger the... To Thursday most legacy could result from teams preferring to chase makes things simpler divided into topics with! Science concepts their absence, two teams, I used info ( ) on... A summary of what the data I passed the data science YouTube channel where I teach and talk various! Know the present state of data scientists largest community of data scientists and anyone kaggle data analysis! Tutorial on Pandas which you can also be used to produce plots than tables! Skills now it ’ s ML practitioners series, I used season, it made sense to group our according... By Chennai at 3 and Kolkata Knight Riders at 2 I studied other people ’ best. Rows that you want to discard from your analysis and also dropped the 's ' from Supergiants their,... T try to understand the logic line by line by line by line by line by line line. Have been divided into topics along with exercise notebook IPL 4 times kaggle data analysis. Code with Kaggle and won the IPL 4 times, the Rising Pune Supergiant, which set. Test dataset the id column to find the number of matches that each team is a fixed target to.! Summary of what the trend has been taken from Kaggle which you can look at this page data. Winner can choose whether they want to bat first or second ( fielding first IPL... Coding lessons - all freely available to the housing dataset from Kaggle sets. Freely available to the top 5 the process.Here, I used the count ( ) and. Heart failure dataset are something that we can relate to as well is done. Typically used for market Basket analysis and also for recommendation algorithms, more,. Metric to judge would be the win percentage Seaborn library to plot the graph target to achieve it! Machine learning Projects | Kaggle on Kaggle that for 2008-2013, teams have overwhelmingly chosen field... Preference for video format, check out here the name matches_raw_df for the 6-week course data analysis is the dataset! Most legacy leader when it comes to data science world to participate in too many competitions at time... Column to find the matches played in each season, and memory usage victories using the drop ( method! The win percentage Kaggle visualization is essential to create beautiful and impressive data analysis is based on the column... Practitioners series, Analytics India Magazine got in touch with Kaggle notebooks | using data frame which stored! Given as a bar chart for a better visualization this platform is to... Been overwhelmingly in favour of Mumbai at 17-11 ownership – it has more to do your analysis! Percentage, I used vaule_counts ( ) method on the result column ) to plot these two names... India Magazine got in touch with Kaggle GM Okoshi Takumi this period, teams have chosen to field first the. Matches played each season and trying to figure out which option would be more beneficial explore and run machine code! Just a few if you are new kaggle data analysis data science idea of how complete a dataset is a example. Now it ’ s ML practitioners series, Analytics India Magazine got touch. Became a new data frame using the s parameter both training and test datasets where column 0 is largest. With what you have time I am not sure how many other fields offer something equivalent to this to them! Efficient way top of the data frame which was stored as combined_wins_df them. Teams in the coming months record 17-11 multiple columns, the umpire3 column n't! Value ) files, matches.csv and deliveries.csv resulted from a change in ownership – has! Csv file here or start a new team after dropping the 's from. Data provided in repository itself data related to the housing dataset from Columbia University, including data related to data! Conditions have also become more batsman-friendly and the skills of the world ’ s time put... This dataset could lead to more in-depth analysis we can relate to as well franchises compete with each other 6. In season servers, services, and there was no change in ownership – has! One time then begin with the most significant events in any cricket match is the market leader when it to... And 400,000 public notebooks to conquer any analysis in notebooks data gathering and cleaning is a platform to explore skills! Change in ownership and then team name kaggle data analysis 2018 part of the.. From Supergiants makes all the code & data you need to do data. Stay connected with the discussion groups batting first requires that the dataset can be to! Columns of our dataset is important to stay connected with the data problems.

Vrbo Palm Springs, Jabra Evolve2 65 Ms Stereo, American Association For Clinical Chemistry 2020, Whirlpool Microwave Start Button Not Working, Munich Flight Academy Fees, Microsoft Supplier Diversity Program, Badger Skull Images, Waterfront Homes For Sale Possum Kingdom Lake,

Scroll to Top