[DATA ANALYSIS] Basic data analysis combining Python, SPSS and PowerBI.
Last week I had to do a presentation on a data analysis project and allow me to share with you the procedure in this post. For the slides, please check this link.
Research goal:
Data Collection
Kaggle everybody, highly recommended! The link to data set is here.
Data Extraction
I performed a bit of data cleaning in Python, the tool that I feel at ease. And due to the fact that we only have a small data set, no need for big data tools here :) As a matter of fact, why I chose these variables was due to research questions that have been mentioned above.
I performed this process in 6 different steps:
- Find all unique values in Salary Estimate column and assign each one with a number or assign them into min-max range if there are too many unique values, -1 as Nan (luckily no nan for this column).
2. Clean the company size column: Assign each category into a number, merge -1 with Unknown.
3. Clean Industry column: assign each unique industry with a number and put these numbers into a new column named “Indsutry_new"
4. Location column cleaning: Separate this column into 2 smaller columns: 1 with state name only named “state", and 1 with state code named “state_new" :
5. Drop all unnecessary columns
6. Write to sav and csv file
Hypothesis testing
4 questions being asked and each one of them are performed with different tests in SPSS. Before performing any test, normality should be checked for variables in question and some with correlation test, the linearity test should be checked first. Further details on checking these metrics can be found from my previous posts.
Data Vizualization
This part was performed by using PowerBI with csv file.
Data insight Presentation
Using Canva.
Walaa, there you go for a whole data analysis process maximizing all tools with their own purpose!