*This is my first time learning Python because I mostly use R in my projects and for all I know, whatever can be done in Python can also be done in R. But as an aspiring data scientist, I shouldn’t just stick with R all the time. I enrolled in this course by Microsoft because it is data-science-centered. Of course this is for everyone interested in Python and not just limited for data scientists. Join me as I’m gonna write down my takeaways here.*

Version 3.x – **https://www.python.org/downloads/**

Python Script – Text Files **.py**

Basics

**print(3 + 4)** # add

**print(4 – 3)** # subtract

**print(4 * 3)** # multiply

**print(4 / 2)** # divide

**print(4 ** 2)** # exponent, 4²

**print(4 % 2)** # modulo

Variables

**height = 1.79**

**weight = 68.7**

**bmi = weight / height ** 2**

Types

**type(bmi)** # float

**type(5)** # int

**type(“body mass index”)** # str

**type(‘this works too’)** # str

**type(True)** # bool

**print(2 + 3)** # 5

**print(‘ab’ + ‘cd’)** # ‘abcd’

**“I said ” + (“Hey ” * 2) + “Hey!”** # ‘I said Hey Hey Hey!’

**str(5)** # convert 5 to a string “5”

**int(True)** # convert True to 1

**bool(“True”)** # convert “True” to True

**float(1)** # convert 1 to t1.0

Lists

**fam = [“liz”, 1.73, “emma”, 1.68, “mom”, 1.71, “dad”, 1.89, [a,b], [c,d]]** # can contain different types, even lists too

**type(fam)** # list

**fam[3]** # 1.68, zero-based indexing

**fam[-1]** # [c,d]

**fam[-3]** # 1.89

**fam[3:5]** # [1.68, “mom”] [start:end] [inclusive:exclusive]

**fam[:4]** # 0 to 3 [“liz”, 1.73, “emma”, 1.68]

**fam[5:]** # 5 to last [1.71, “dad”, 1.89, [a,b], [c,d]]

**fam[0:2] = [“lisa”, 1.74] **# fam = [“lisa”, 1.74, “emma”, 1.68, “mom”, 1.71, “dad”, 1.89, [a,b], [c,d]]

**fam + [“me”, 1.79]** # [“lisa”, 1.74, “emma”, 1.68, “mom”, 1.71, “dad”, 1.89, [a,b], [c,d], “me”, 1.79]

**del(fam[2])** # [“lisa”, 1.74, 1.68, “mom”, 1.71, “dad”, 1.89, [a,b], [c,d], “me”, 1.79]

**x = [“a”, “b”, “c”]**

**y = x**

**y[1] = “z”** # x[1] is also z because you copied the reference to the list, not the actual values themselves

**y = list(x)** # or **y = x[:]** to select all elements

**fam.index(“mom”)** # finds “mom” and returns its index: 4

**fam.count(1.74)** # counts the number of times 1.74 occurs in the list; returns 1

**first = [11.25, 18.0, 20.0]**

**second = [10.75, 9.50]**

**full = first + second** # paste together

**full_sorted = sorted(full, reverse=True)** # sort in descending order

Functions

**max(fam)** # maximum value in the list

**round(1.68, 1)** # round 1.68 to 1 decimal place, 1.7

**round(1.68)** # round to nearest whole number

**help(round)** # opens documentation of round function

**len(fam)** # length of list

Methods

*Methods are functions but they differ from function because they call functions on objects.*

**sister = ‘liz’**

**sister.capitalize()** # ‘Liz’

**sister.replace(“z”, “sa”)** # ‘lisa’

**sister.index(“z”)** # 2

**fam = [“liz”, 1.73, “emma”, 1.68, “mom”, 1.71, “dad”, 1.89] **

**fam.index(“mom”)** # 4

**fam.append(“me”)** # fam = [“liz”, 1.73, “emma”, 1.68, “mom”, 1.71, “dad”, 1.89, “me”] fam automatically updated even without re-assigning to fam

**sister.upper()** # ‘LIZ’

**sister.count(“i”)** # 1

**fam.reverse()** # fam = [“me”, 1.89, “dad”, 1.71, “mom”, 1.68, “emma”, 1.73, “liz”] fam automatically updated even without re-assigning to fam

Numpy

*Numpy (Numeric Python) efficiently works with arrays. Once installed…*

**import numpy as np** # personal preference for calling the numpy package; can be done without the *as np* but the whole *numpy* word should be used when calling a numpy function like *array*

**np.array([1, 2, 3])
a = [1, 2, 3]
b = [4, 5, 6]
np_a = np.array(a)
np_b = np.array(b)
np_a / np_b ** 2 **# can perform element-wise operations

**np.array([1.0, “is”, True])**# will all turn to string because because Numpy arrays contain only one type

**python_list = [1, 2, 3]**

**python_list + python_list**# [1, 2, 3, 1, 2, 3]

**numpy_array = np.array([1, 2, 3])**

**numpy_array + numpy_array**# array([2, 4, 6])

**a[1]**# 2

**a > 1**# array([False, True, True], dtype=bool)

**a[a > 1]**# array([2, 3])

**np_2d = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])**# 2D array

**np_2d.shape**# returns the dimension of the array; (2, 3) since 2 rows and 3 columns

**np_2d[0]**# array([1, 2, 3])

**np_2d[0][1]**# 2

**np_2d[0, 1]**# 2

**np_2d[:, 1:3]**# array([[1, 2, 3, 4], [6, 7, 8]])

**np_2d[1, :]**# array([5, 6, 7, 8])

**np_2d_another = np.array([[1, 1, 1, 1], [1, 1, 1, 1]])**

np_2d + np_2d_another# array([[2, 3, 4, 5], [6, 7, 8, 9]])

np_2d + np_2d_another

**np.mean(np_2d)**# mean

**np.median(np_2d)**# median

**np.corrcoef(np_2d, np_2d_another)**# correlation

**np.std(np_2d)**# standard deviation

**np.sum(np_a)**# sum, faster

**np.sort(np_a)**# sort, faster

**height = np.round(np.random.normal(1.75, 0.20, 5000), 2)**# 1.75 distribution mean, 0.20 distribution standard deviation, 5000 number of samples

**weight = np.round(np.random.normal(60.32, 15, 5000), 2)**

**np_city = np.column_stack(height, weight)**# combine height and weight by column

**gk_heights = np_heights[np_positions == ‘GK’]**# use other array’s index

Matplotlib

*Package usually used for data visualization*

**import matplotlib.pyplot as plt **# import matplotlib with plt as alias

**year = [1950, 1970, 1990, 2010]**

**pop = [2.519, 3.692, 5.263, 6.972]**

**plt.plot(year, pop)** # (horizontal, vertical) line plot

**pop = [1.0, 1.262, 1.650] + pop** # include these values too

**year = [1800, 1850, 1900] + year** # include the 3 years

**plt.fill_between(year, population, 0, color=’green’)** # fill with color green

**plt.xlabel(‘Year’)** # x axis label

**plt.ylabel(‘Population’)** # y axis label

**plt.title(‘World Population Projections’)** # title label

**plt.yticks([0, 2, 4, 6, 8, 10])** # all the ticks you want to display in the y axis

**plt.yticks([0, 2, 4, 6, 8, 10], [‘0’, ‘2B’, ‘4B’, ‘6B’, ‘8B’, ’10B’])** # 2nd argument are the labels

**plt.show()** # only then the plot will build

**plt.scatter(year, pop)** # scatter plot**
plt.xscale(‘log’) **# put the x axis on a logarithmic scale

**help(plt.hist)**# help of function hist in module matplotlib.pyplot

**plt.hist(pop, bins = 3)**# histogram of pop with 3 bins

**plt.clf()**# clean up plot

**plt.scatter(x = gdp_cap, y = life_exp, s = np.array(pop) * 2, c = col, alpha = 0.8)**# another scatter plot example

Boolean Logic and Control Flow

**x = 12**

**x > 5 and x < 15** # True

**y = 5**

**y <= 7 or y > 13** # True

**z = 4**

**z % 2 == 0** # z modulo 2 is 0 or z is divisible by 2

Pandas

*Unike Numpy, Pandas can handle an array with more than one type; for data frames.*

import pandas as pd

**brics = pd.read_csv(“brics.csv”)** # load csv file

**brics = pd.read_csv(“brics.csv”, index_col=0)** # to indicate that there are row indexes

**brics[“country”]** # returns the column country

**brics.country** # returns the column country too

**brics[“on_earth”]** = [True, True, True, True, True] # adding a column

**brics[“density”] = brics[“population”] / brics[“area”] * 10000000**

**brics.loc[“BR”]** # row access by index

**brics.loc[“CH”, “capital”]** # element access; row, column

**brics[“capital”].loc[“CH”]** # element access; column, row

**brics.loc[“CH”][“capital”]** # element access; row, column

**brics.loc[“BR”]** # returns series

**brics.loc[[“BR”]]** # returns dataframe

Scikit-learn

*Module for machine learning purposes*

**from sklearn import datasets** # datasets function of sklearn

**iris = datasets.load_iris()** # call the method that loads the iris dataset

**iris.data** # features

**iris.target** # response variable/s