Outils pour utilisateurs

Outils du site


python:first_course_statistics

Différences

Ci-dessous, les différences entre deux révisions de la page.

Lien vers cette vue comparative

Les deux révisions précédentes Révision précédente
Prochaine révision
Révision précédente
Prochaine révision Les deux révisions suivantes
python:first_course_statistics [2016/10/08 09:11]
Francesco Beretta [Eruptions of the Old Faithful geyser (p.5)]
python:first_course_statistics [2016/10/28 14:36]
Beretta, Anna Letizia
Ligne 1: Ligne 1:
-====== Lessons ====== 
  
 +====== General instructions ======
  
 +Read following important documentation about:
 +  * pandas [[http://​pandas.pydata.org/​pandas-docs/​stable/​dsintro.html#​dataframe|dataframes]]
 +  * [[http://​matplotlib.org/​api/​pyplot_summary.html|matplotlib.pyplot]]
  
-===== Eruptions of the "Old Faithful"​ geyser (p.5) =====+Save your scripts in a folder inside the data folder, calling the script folder '​my_scripts'​ or whaterver. If  '​my-scripts'​ is set as your [[python:​generic_features#​get_the_current_working_directory_address|current working directory]],​ then the data files are available under this address '​../​[data file]',​ for instantce: '​../​geyser1.TAB'​ 
 +\\ 
 +====== Eruptions of the "Old Faithful"​ geyser (p.5) ======
  
 +\\
  
-==== Histogram (p.5) ====+===== Histogram (p.5) =====
  
 <code python> <code python>
-# fake code – to be deleted +import pandas as pd 
-import ​csv +import ​matplotlib.pyplot as plt 
-filename ​= 'ch02-data.csv' +gys1 pd.DataFrame(pd.read_csv('../​geyser1.TAB', '​\t'​)
-f = open(filename+g_int gys1['​Interval'​
-data = [] +ax plt.gca() 
-reader ​csv.reader(f+ax.hist(g_int,​ bins=20, color='​r'​) 
-header ​reader.next() +ax.set_xlabel('​Intereruption time'
-data = [row for row in reader] +ax.set_ylabel('​Frequency'​) 
-for datarow in data: +ax.set_title('​Histogram'​) 
-    print datarow+plt.show() ​
 </​code>​ </​code>​
  
  
-===== International adoption rates (p.13) =====+\\ 
 + 
 +===== Boxplot (p. 6) ===== 
 + 
 +<code python>​ 
 +import matplotlib.pyplot as plt 
 +import pandas as pd 
 +gysr1_boxplot = pd.read_csv('​...\geyser1.TAB',​ '​\t'​) 
 +data_gysr1 = gysr1_boxplot['​Interval'​] 
 +plt.boxplot(data_gysr1) 
 +ax = plt.gca() 
 +ax.set_xlabel('​222 cases'​) 
 +ax.set_ylabel('​Interruption time ( minutes'​) 
 +ax.set_title('​Box and Whisker Plot'​) 
 +plt.show() 
 +</​code>​ 
 + 
 + 
 +\\ 
 + 
 +===== ScatterPlot (p. 7) ===== 
 + 
 +AB: Put face- and edgecolor to change both of them. You can also have two different colors for the in- and outside of each dot. 
 + 
 +<code python>​ 
 +import matplotlib.pyplot as plt 
 +import pandas as pd 
 +geysr1_scatterplot = pd.read_csv('​...\geyser1.TAB',​ '​\t'​) 
 +geysr1_data_Xax = geysr1_scatterplot['​Duration'​] 
 +geysr1_data_Yax = geysr1_scatterplot['​Interval'​] 
 +plt.scatter(geysr1_data_Xax,​ geysr1_data_Yax,​ facecolor='​y',​ edgecolor='​y'​) 
 +ax = plt.gca() 
 +ax.set_xlabel('​Eruption duration time (minutes)'​) 
 +ax.set_ylabel('​Interuption time (minutes)'​) 
 +ax.set_title('​Scatter Plot of INTERVAL vs DURATION'​) 
 +plt.show() 
 +</​code>​ 
 + 
 + 
 +\\ 
 + 
 + 
 +===== Descriptive statistics (p.9) ===== 
 + 
 +Note: try different examples, e.g. the whole population or only those where '​Duration'​ <= 3, the whole dataframe 
 + 
 +[[http://​pandas.pydata.org/​pandas-docs/​stable/​basics.html#​descriptive-statistics|doc]] – [[http://​www.marsja.se/​pandas-python-descriptive-statistics/​|example]] 
 + 
 +<code python>​ 
 +import pandas as pd 
 +gysr1 = pd.read_csv('​../​geyser1.tab',​ '​\t'​) 
 +gysr1['​Duration'​][gysr1['​Duration'​] <= 3].describe() 
 +</​code>​ 
 + 
 + 
 +\\ 
 + 
 + 
 +===== Boxplot (p.9) ===== 
 + 
 +Selecting rows in a dataframe: [[http://​pandas.pydata.org/​pandas-docs/​stable/​indexing.html#​the-where-method-and-masking|doc]] / [[http://​stackoverflow.com/​questions/​17071871/​select-rows-from-a-dataframe-based-on-values-in-a-column-in-pandas|example]] 
 + 
 +<code python>​ 
 +import matplotlib.pyplot as plt 
 +import pandas as pd 
 +gysr1 = pd.read_csv('​../​geyser1.tab',​ '​\t'​) 
 +gysr1_inf3 = gysr1.loc[gysr1['​Duration'​] <= 3] 
 +gysr1_sup3 = gysr1.loc[gysr1['​Duration'​] > 3] 
 +plt.boxplot([gysr1_inf3['​Interval'​],​gysr1_sup3['​Interval'​]],​ labels= ['​inf3','​sup3'​]) 
 +</​code>​ 
 + 
 + 
 +\\ 
 + 
 + 
 +====== International adoption rates (p.13) ====== 
 + 
 +===== Boxplot (p.14) ===== 
 + 
 +<code python>​ 
 +import matplotlib.pyplot as plt 
 +import pandas as pd 
 +adopt_data = pd.read_csv('​D:​\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\\adopt.TAB',​ '​\t'​) 
 +adopt1 = adopt_data['​Visa91'​] 
 +plt.boxplot(adopt1) 
 +ax = plt.gca() 
 +ax.set_title('​Box and Whisker Plot'​) 
 +ax.set_xlabel('​39 cases'​) 
 +ax.set_ylabel('​Number of visas in 1991'​) 
 +plt.show() 
 +</​code>​ 
 + 
 + 
 +\\ 
 + 
 + 
 +===== Histogram (p.14) ===== 
 + 
 +<code python>​ 
 +import matplotlib.pyplot as plt 
 +import pandas as pd 
 +adopt_data = pd.read_csv('​D:​\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\\adopt.TAB',​ '​\t'​) 
 +adopt1 = adopt_data['​Visa91'​] 
 +plt.hist(adopt1) 
 +plt.show() 
 +</​code>​ 
 + 
 + 
 +\\ 
 + 
 +=====Histogram with Log(p.18)===== 
 +don't find the way to do it 
 +<code Python>​ 
 +import pandas as pd 
 +import matplotlib.pyplot as plt 
 +adopt = pd.DataFrame(pd.read_csv('​D:​\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\\adopt.TAB',​ '​\t'​)) 
 +adopt_loghist = adopt['​Visa91'​] 
 +#​adopt_loghist.semilogx() --> was one of the possibilities 
 +ax = plt.gca() 
 +ax.hist(adopt_loghist,​ bins=10, plt.loglog(0.5,​3.5),​ color='​r'​) #put log=True instead, but you will get the log for the frequencies 
 +plt.gca().set_xscale("​log"​) 
 +ax.set_xlabel('​Log (Number of 1991 visas'​) 
 +ax.set_ylabel('​Frequency'​) 
 +ax.set_title('​Histogram'​) 
 +plt.show()  
 +</​code>​ 
 + 
 + 
 +=====Scatterplot (p. 17)===== 
 +<code python>​ 
 +import matplotlib.pyplot as plt 
 +import pandas as pd 
 +adoption_scatterplot = pd.read_csv('​...\adopt.TAB',​ '​\t'​) 
 +adopt_data_Xax = adoption_scatterplot['​Visa88'​] 
 +adopt_data_Yax = adoption_scatterplot['​Visa91'​] 
 +plt.scatter(adopt_data_Xax,​ adopt_data_Yax,​ facecolor='​y',​ edgecolor='​y'​) 
 +ax = plt.gca() 
 +ax.set_xlabel('​Number of Visas in 1988'​) 
 +ax.set_ylim([0,​2700]) 
 +ax.set_xlim([0,​5000]) 
 +ax.set_ylabel('​Number of Visas in 1991'​) 
 +ax.set_title('​ScatterPlot of Visa91 vs Visa88'​) 
 +plt.show() 
 +</​code>​ 
 + 
 + 
 +\\ 
 + 
 + 
 +=====Scatterplot (p.18)===== 
 +<code python>​ 
 +import matplotlib.pyplot as plt 
 +import pandas as pd 
 +adoption_scatterplot = pd.read_csv('​D:​\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\\adopt.TAB',​ '​\t'​) 
 +adopt_data_Xax = adoption_scatterplot['​Visa91'​] 
 +adopt_data_Yax = adoption_scatterplot['​Visa92'​] 
 +plt.scatter(adopt_data_Xax,​ adopt_data_Yax,​ facecolor='​y',​ edgecolor='​y'​) 
 +ax = plt.gca() 
 +ax.set_xlabel('​Number of Visas in 1991'​) 
 +ax.set_ylim([0,​1800]) 
 +ax.set_xlim([0,​2700]) 
 +ax.set_ylabel('​Number of Visas in 1992'​) 
 +ax.set_title('​ScatterPlot of Visa92 vs Visa91'​) 
 +plt.show() 
 +</​code>​ 
 + 
 +\\ 
 + 
 + 
 +====== The Performance of stock mutual funds (p. 21) ====== 
 + 
 + 
 + 
 + 
 + 
 + 
 +\\ 
 + 
 +====== Predicting the sales and airplay of popular music (p. 23)====== 
 + 
 + 
 + 
 + 
 +\\ 
 + 
 +====== Another look at the "Old faithful"​ geyser and adoption visas (p.24) ====== 
 + 
 +Modified the bins of the both histograms:​ 
 +The Histogram is reliable for the "Old faithful"​ geyser but not for the Adoption rates. The appearance of the histogram changes quite a lot by changing the bins. 
 + 
 +\\ 
 + 
 +====== Productivity versus quality in the assembly plant (p. 25)====== 
 + 
 + 
 +===== Scatterplot of Productivity vs Quality (p. 26) ===== 
 +<code Python>​ 
 +import pandas as pd 
 +import matplotlib.pyplot as plt 
 +scatter_plot = pd.read_csv('​D:​\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\\prdq.TAB',​ '​\t'​) 
 +productivity_Y = scatter_plot['​Producti'​] 
 +quality_X = scatter_plot['​Quality'​] 
 +plt.scatter(productivity_Y,​ quality_X, bins=20, colors='​r'​) 
 +ax = plt.gca() 
 +ax.set_Xlabel('​Assembly defects per 100 cars'​) 
 +ax.set_Ylabel('​Hours per vehicle'​) 
 +ax.set_title('​Scatter Plot of Productivity VS Quality'​) 
 +plt.show() 
 +</​code>​ 
 + 
 +\\ 
 + 
 +=====Scatter Plot of PRODJAPN vs QUALJAPN (p. 27) =====
  
python/first_course_statistics.txt · Dernière modification: 2017/09/26 08:54 par Francesco Beretta