5 tips you can use to get that data ready...
1) Find "holes" in your data:
-- missing data in a particular column
WHERE column_a IS NULL
-- patterns associated with holes
SELECT
column_b,
COUNT(*) AS records
FROM tablename
WHERE column_a IS NULL
GROUP BY column_b
ORDER BY COUNT(*) DESC;
2) Flag certain records (simple binary column):
CASE
WHEN column_a is NULL
OR column_b = 'some impossible value'
THEN 1
ELSE 0
END AS is_dirty
3) Clean up messy values (to standardize them):
CASE
WHEN location IN('CA','ca','San Diego, CA') THEN 'CA'
ELSE location
END AS clean_location
4) Bucket overly granular values into segments:
CASE
WHEN age < 18 THEN '1. < 18'
WHEN age BETWEEN 18 AND 35 THEN '2. 18-35'
WHEN age > 35 THEN '3. > 35'
END AS age_group
5) Analyze values across multiple records (powerful):
-- note the use of 3,2,1 to reorder the values
SELECT
subscriber_id,
MAX(
CASE
WHEN status = 'paying_subscriber' THEN '3. Active'
WHEN status = 'still_in_trial' THEN '2. Trial'
WHEN status = 'canceled' THEN '1. Canceled'
END) AS current_status
FROM subscription_table
GROUP BY subscriber_id
What else do you do to clean your data?
Database Used:
1) Basics
-
Top 5 rows
-
WHERE , AND, OR, NOT, IN
-
ORDER BY, ASC, DESC
-
IS NULL
-
LIMIT
-
Last 3 records
-
MIN, MAX, COUNT, AVG, SUM
-
LIKE, WILDCARDS
-
IN BETWEEN
-
UNION
-
LEFT, RIGHT, MID, CONCAT
-
GROUP BY
-
FIRST_VALUE, LAST_VALUE
-
CASE WHEN
-
UNION
-
Count the number of non-null entries in a column
-
When to use group by vs. distinct
-
1st and last record from the table
-
Count of emp working in same dept
-
String formatting, substring
-
Subqueries
-
HAVING vs WHERE
-
MAX and MIN
-
Difference between HAVING and WHERE
-
When are COUNT(*) and COUNT DISTINCT equal
-
Can left and inner join return the same results?
-
What is the difference between RANK and DENSE RANK
-
Can UNION and UNION ALL return the same results?
-
What are the best scenarios to use a self join?
-
How can you use SQL to dedupe rows?
-
Why would you use GROUP BY ROLLUP?
-
What happens if I GROUP BY a column that is not in the SELECT statement? Why does this
happen?
-
For dealing with NULL values, why would I choose to use IFNULL vs. CASE WHEN?
-
Do temp tables make your code cleaner and faster, one of the two, or none? Why?
-
When is a subquery a bad idea? A good idea?
-
Why use the WITH statement? What does it help you do?
-
Can you use COUNT DISTINCT on two columns? How?
-
When does a full outer join actually make sense to use?
-
Can UNION and UNION ALL return the same results?
-
When is it a bad idea to use a subquery?
-
What is the best approach to doing a running total?
-
Inner joining tables A and B returns 50 rows. Left join B to A returns 125 rows. What's going
on here?
-
User Defined Functions
-
COALESCE and CASE WHEN
References Used:
- https://www.linkedin.com/posts/eric-weber-060397b7_datascience-data-sql-activity-6787722309848350721-PlIH
- https://www.youtube.com/watch?v=UlXtfq-kuF4
- https://www.linkedin.com/posts/srishti-srivastava-aaa04113a_datascience-sql-data-activity-6782611503947268097-ftdA
- https://www.linkedin.com/posts/eric-weber-060397b7_data-datascience-analytics-activity-6794261762872020992-0Eha
- https://www.linkedin.com/posts/eric-weber-060397b7_data-datascience-sql-activity-6780300222632407040-kwSo
- https://www.linkedin.com/posts/eric-weber-060397b7_sql-data-datascience-activity-6808737477549969408-TJSi
- https://www.linkedin.com/posts/eric-weber-060397b7_sql-datascience-machinelearning-activity-6808702703619903488-T2yA
- https://www.linkedin.com/posts/datawithdanny_data-analytics-datascience-activity-6805675440045469696-NSqo
- https://www.linkedin.com/posts/eric-weber-060397b7_data-datascience-analytics-activity-6804394683897298944-oLY4
- https://www.linkedin.com/posts/eric-weber-060397b7_data-datascience-sql-activity-6715986430864224256-xDv3
- https://www.linkedin.com/posts/eric-weber-060397b7_sql-datascience-data-activity-6707148760071372801-zAKk
- https://www.linkedin.com/posts/akashraj001_my-checklist-before-going-for-an-sql-round-activity-6794270380061188096-ATmc
- https://www.linkedin.com/posts/eric-weber-060397b7_sql-datascience-statistics-activity-6808368707434291200-Uf70
- https://www.linkedin.com/posts/johnpauler_mavenquicktips-sql-data-activity-6807275059107942400-GRt_