About my datacamp learning process
keep lectures, notebooks, progress, ... and git structure
I started learning with Datacamp in March 2019. This is a great resource and I recommend all datascience newcomers to give it a shot.
What I like are the consistent courses content. There is an overall logic between all courses. And content is just incredible: more than 300 interactive courses. OK maybe you won't find all of them super useful but at least you can pick what is of interest for you. Following my learning process it takes me about 8 hours to complete a course.
Career tracks are a smart way to help you build a 1st tour in your datascience journey. I followed python programmer
(old version), data scientist with python
(old version) and machine learning scientist with python
tracks. Mileage may vary but it is about 20 courses per track. Updated versions of tracks are now online and this is a mix between courses, projects and skills assessments. I have tested one project but it is a little bit too basic for me.
There is a nice and smooth progress tracking system, and as in a game you earn XP for each achivement.
A natural way to select courses is to browse through courses from career tracks. And I will complete courses from new version of career tracks. Or when I need to learn on a new domain, I just search for relevant courses (search engine is very good).
I have 2 ways to track these courses:
- bookmarks in Datacamp
- entries in ITP (individual training plan, a big excel list of learning items I plan to follow)
As an example I will use
which is a project from the new Data Scientist
career track and which is in my ITP:
In my data-scientist-skills
github repo, I have 2 folders:
-
Other datacamp courses
- where I keep lectures (pdf slides) from datacamp courses -
python-sandbox
- where I keep notebooks and data from datacamp exercises
- creation of
Data Manipulation with pandas
folder underOther datacamp courses
- creation of
data-manipulation-with-pandas
folder underpython-sandbox
- copy of
python-sandbox/_1project-template/
intopython-sandbox/data-manipulation-with-pandas
In this project template,
-
data_from_datacamp
will store all data needed to launch datacamp exercises -
exports_py
will contain exports of notebooks in txt/py format (usefull to search on code patterns) -
start_env.sh
start_env.bat
to launch jupyter notebook from the right conda env -
downloadfromFileIO.py
to download data files from my local notebooks (using in the background file.io) -
uploadfromdatacamp.py
to upload data files from datacamp -
uploadfromdatacamp_examples.py
some examples to transfer dataframes, dataseries, lists, ...
After initialisation, I have the following structure and content:
On your left lectures (one per chapter) and final certificate.
On your right notebooks.
Just run the jupyter notebook environment by calling start_env.sh
.
Get the chapter title:
And name the notebook accordingly:
Then enter interactive instructions. I copy paste instructions using copy selection as markdown firefox add-on.
Here in this example, if I want to follow instructions locally I need to have homelessness
dataframe.
I can use the following code from uploadfromdatacamp_examples.py
##### Dataframe
###################
#upload and download
from downloadfromFileIO import saveFromFileIO
""" à executer sur datacamp: (apres copie du code uploadfromdatacamp.py)
uploadToFileIO(homelessness)
"""
tobedownloaded="""
{pandas.core.frame.DataFrame: {'homelessness.csv': 'https://file.io/vTM1t2ehXds4'}}
"""
prefixToc='1.1'
prefix = saveFromFileIO(tobedownloaded, prefixToc=prefixToc, proxy="")
#initialisation
import pandas as pd
homelessness = pd.read_csv(prefix+'homelessness.csv',index_col=0)
Before executing this cell, I have to copy/paste/execute uploadfromdatacamp.py
content on datacamp server. And call
uploadToFileIO(homelessness)
Then get the results last line
In [2]:
uploadToFileIO(homelessness)
{"success":true,"key":"vTM1t2ehXds4","link":"https://file.io/vTM1t2ehXds4","expiry":"14 days"}
{pandas.core.frame.DataFrame: {'homelessness.csv': 'https://file.io/vTM1t2ehXds4'}}
and copy it in tobedownloaded
variable.
Update prefixTOC to the good value (exercise 1.1 is the 1st one in first chapter) which is used as a prefix in data files. And update local variable name and csv file.
Run the cell
Here is the result
Téléchargements à lancer
{'pandas.core.frame.DataFrame': {'homelessness.csv': 'https://file.io/vTM1t2ehXds4'}}
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 2528 0 2528 0 0 4870 0 --:--:-- --:--:-- --:--:-- 4870
And homelessness
is available to be used.
Files downloaded are in data_from_datacamp
folder.
And running again the cell won't download file from file.io, but will read the cached file. (delete file to force download)
Full content of this notebook example at the bottom
~/git/guillaume/data-scientist-skills$ git add .
~/git/guillaume/data-scientist-skills$ git commit -m 'start of data manipulation in pandas course'
[master c8696ce] start of data manipulation in pandas course
45 files changed, 9010 insertions(+)
create mode 100644 Other datacamp courses/Data Manipulation with pandas/chapter1.pdf
create mode 100644 python-sandbox/data-manipulation-with-pandas/.ipynb_checkpoints/chapter1 - Transforming Data-checkpoint.ipynb
create mode 100644 python-sandbox/data-manipulation-with-pandas/__pycache__/downloadfromFileIO.cpython-37.pyc
create mode 100644 python-sandbox/data-manipulation-with-pandas/chapter1 - Transforming Data.ipynb
create mode 100644 python-sandbox/data-manipulation-with-pandas/data_from_datacamp/.empty_dir.txt
create mode 100644 python-sandbox/data-manipulation-with-pandas/data_from_datacamp/chapter1 - Transforming Data-Exercise1.1_3277903540843719836.lock
create mode 100644 python-sandbox/data-manipulation-with-pandas/data_from_datacamp/chapter1 - Transforming Data-Exercise1.1_homelessness.csv
create mode 100644 python-sandbox/data-manipulation-with-pandas/downloadfromFileIO.py
create mode 100644 python-sandbox/data-manipulation-with-pandas/exports_py/.empty_dir.txt
create mode 100644 python-sandbox/data-manipulation-with-pandas/exports_py/Untitled.py
create mode 100644 python-sandbox/data-manipulation-with-pandas/exports_py/Untitled.txt
create mode 100644 python-sandbox/data-manipulation-with-pandas/exports_py/chapter1 - Transforming Data.py
create mode 100644 python-sandbox/data-manipulation-with-pandas/start_env.bat
create mode 100755 python-sandbox/data-manipulation-with-pandas/start_env.sh
create mode 100644 python-sandbox/data-manipulation-with-pandas/uploadfromdatacamp.py
create mode 100644 python-sandbox/data-manipulation-with-pandas/uploadfromdatacamp_examples.py
~/git/guillaume/data-scientist-skills$ git push
Enumerating objects: 43, done.
Counting objects: 100% (43/43), done.
Delta compression using up to 12 threads
Compressing objects: 100% (38/38), done.
Writing objects: 100% (40/40), 5.75 MiB | 3.85 MiB/s, done.
Total 40 (delta 8), reused 1 (delta 0)
remote: Resolving deltas: 100% (8/8), completed with 3 local objects.
To github.com:castorfou/data-scientist-skills.git
89f60e5..c8696ce master -> master
Datacamp is giving instant progress
So I regularly report this progress (here 0.18/4=5%) in ITP.
I download and keep certificates with lectures.
Inspecting a DataFrame | Python
Inspecting a DataFrame
When you get a new DataFrame to work with, the first thing you need to do is explore it and see what it contains. There are several useful methods and attributes for this.
.head()
returns the first few rows (the “head” of the DataFrame)..info()
shows information on each of the columns, such as the data type and number of missing values..shape
returns the number of rows and columns of the DataFrame..describe()
calculates a few summary statistics for each column.
homelessness
is a DataFrame containing estimates of homelessness in each U.S. state in 2018. Theindividual
column is the number of homeless individuals not part of a family with children. Thefamily_members
column is the number of homeless individuals part of a family with children. Thestate_pop
column is the state's total population.
pandas
is imported for you.
##### Dataframe
###################
#upload and download
from downloadfromFileIO import saveFromFileIO
""" à executer sur datacamp: (apres copie du code uploadfromdatacamp.py)
uploadToFileIO(homelessness)
"""
tobedownloaded="""
{pandas.core.frame.DataFrame: {'homelessness.csv': 'https://file.io/vTM1t2ehXds4'}}
"""
prefixToc='1.1'
prefix = saveFromFileIO(tobedownloaded, prefixToc=prefixToc, proxy="")
#initialisation
import pandas as pd
homelessness = pd.read_csv(prefix+'homelessness.csv',index_col=0)
Print the head of the
homelessness
DataFrame.
print(homelessness.head())
Print information about the column types and missing values in
homelessness
.
print(homelessness.info())
Print the number of rows and columns in
homelessness
.
print(homelessness.shape)
Print some summary statistics that describe the
homelessness
DataFrame.
print(homelessness.describe())