Outils pour utilisateurs

Outils du site


python:first_course_statistics

Différences

Ci-dessous, les différences entre deux révisions de la page.

Lien vers cette vue comparative

Les deux révisions précédentes Révision précédente
Prochaine révision
Révision précédente
python:first_course_statistics [2016/10/17 07:41]
Francesco Beretta [ScatterPlot (p. 7)]
python:first_course_statistics [2017/09/26 08:54] (Version actuelle)
Francesco Beretta [General instructions]
Ligne 5: Ligne 5:
   * pandas [[http://​pandas.pydata.org/​pandas-docs/​stable/​dsintro.html#​dataframe|dataframes]]   * pandas [[http://​pandas.pydata.org/​pandas-docs/​stable/​dsintro.html#​dataframe|dataframes]]
   * [[http://​matplotlib.org/​api/​pyplot_summary.html|matplotlib.pyplot]]   * [[http://​matplotlib.org/​api/​pyplot_summary.html|matplotlib.pyplot]]
 +
 +
 +Get the data from [[http://​people.stern.nyu.edu/​jsimonof/​Casebook/​Data/​ASCII/​README.html|this site]].
 +
  
 Save your scripts in a folder inside the data folder, calling the script folder '​my_scripts'​ or whaterver. If  '​my-scripts'​ is set as your [[python:​generic_features#​get_the_current_working_directory_address|current working directory]],​ then the data files are available under this address '​../​[data file]',​ for instantce: '​../​geyser1.TAB'​ Save your scripts in a folder inside the data folder, calling the script folder '​my_scripts'​ or whaterver. If  '​my-scripts'​ is set as your [[python:​generic_features#​get_the_current_working_directory_address|current working directory]],​ then the data files are available under this address '​../​[data file]',​ for instantce: '​../​geyser1.TAB'​
Ligne 35: Ligne 39:
 import matplotlib.pyplot as plt import matplotlib.pyplot as plt
 import pandas as pd import pandas as pd
-gysr1_boxplot = pd.read_csv('​...\geyser1.TAB',​ '​\t'​)+gysr1_boxplot = pd.read_csv('​.../geyser1.TAB',​ '​\t'​)
 data_gysr1 = gysr1_boxplot['​Interval'​] data_gysr1 = gysr1_boxplot['​Interval'​]
 plt.boxplot(data_gysr1) plt.boxplot(data_gysr1)
Ligne 55: Ligne 59:
 import matplotlib.pyplot as plt import matplotlib.pyplot as plt
 import pandas as pd import pandas as pd
-geysr1_scatterplot = pd.read_csv('​...\geyser1.TAB',​ '​\t'​)+geysr1_scatterplot = pd.read_csv('​.../geyser1.TAB',​ '​\t'​)
 geysr1_data_Xax = geysr1_scatterplot['​Duration'​] geysr1_data_Xax = geysr1_scatterplot['​Duration'​]
 geysr1_data_Yax = geysr1_scatterplot['​Interval'​] geysr1_data_Yax = geysr1_scatterplot['​Interval'​]
Ligne 65: Ligne 69:
 plt.show() plt.show()
 </​code>​ </​code>​
 +
 +
 +\\
 +
 +
 +===== Descriptive statistics (p.9) =====
 +
 +Note: try different examples, e.g. the whole population or only those where '​Duration'​ <= 3, the whole dataframe
 +
 +[[http://​pandas.pydata.org/​pandas-docs/​stable/​basics.html#​descriptive-statistics|doc]] – [[http://​www.marsja.se/​pandas-python-descriptive-statistics/​|example]]
 +
 +<code python>
 +import pandas as pd
 +gysr1 = pd.read_csv('​../​geyser1.tab',​ '​\t'​)
 +gysr1['​Duration'​][gysr1['​Duration'​] <= 3].describe()
 +</​code>​
 +
 +
 +\\
 +
  
 ===== Boxplot (p.9) ===== ===== Boxplot (p.9) =====
Ligne 78: Ligne 102:
 plt.boxplot([gysr1_inf3['​Interval'​],​gysr1_sup3['​Interval'​]],​ labels= ['​inf3','​sup3'​]) plt.boxplot([gysr1_inf3['​Interval'​],​gysr1_sup3['​Interval'​]],​ labels= ['​inf3','​sup3'​])
 </​code>​ </​code>​
 +
 +
 +\\
 +
  
 ====== International adoption rates (p.13) ====== ====== International adoption rates (p.13) ======
  
 +===== Boxplot (p.14) =====
 +
 +<code python>
 +import matplotlib.pyplot as plt
 +import pandas as pd
 +adopt_data = pd.read_csv('​D:​\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\\adopt.TAB',​ '​\t'​)
 +adopt1 = adopt_data['​Visa91'​]
 +plt.boxplot(adopt1)
 +ax = plt.gca()
 +ax.set_title('​Box and Whisker Plot')
 +ax.set_xlabel('​39 cases'​)
 +ax.set_ylabel('​Number of visas in 1991')
 +plt.show()
 +</​code>​
 +
 +
 +\\
 +
 +
 +===== Histogram (p.14) =====
 +
 +<code python>
 +import matplotlib.pyplot as plt
 +import pandas as pd
 +adopt_data = pd.read_csv('​D:​\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\\adopt.TAB',​ '​\t'​)
 +adopt1 = adopt_data['​Visa91'​]
 +plt.hist(adopt1)
 +plt.show()
 +</​code>​
 +
 +
 +\\
 +
 +=====Histogram with Log(p.18)=====
 +don't find the way to do it
 +<code Python>
 +import pandas as pd
 +import matplotlib.pyplot as plt
 +adopt = pd.DataFrame(pd.read_csv('​D:​\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\\adopt.TAB',​ '​\t'​))
 +adopt_loghist = adopt['​Visa91'​]
 +#​adopt_loghist.semilogx() --> was one of the possibilities
 +ax = plt.gca()
 +ax.hist(adopt_loghist,​ bins=10, plt.loglog(0.5,​3.5),​ color='​r'​) #put log=True instead, but you will get the log for the frequencies
 +plt.gca().set_xscale("​log"​)
 +ax.set_xlabel('​Log (Number of 1991 visas'​)
 +ax.set_ylabel('​Frequency'​)
 +ax.set_title('​Histogram'​)
 +plt.show() ​
 +</​code>​
 +
 +
 +=====Scatterplot (p. 17)=====
 +<code python>
 +import matplotlib.pyplot as plt
 +import pandas as pd
 +adoption_scatterplot = pd.read_csv('​...\adopt.TAB',​ '​\t'​)
 +adopt_data_Xax = adoption_scatterplot['​Visa88'​]
 +adopt_data_Yax = adoption_scatterplot['​Visa91'​]
 +plt.scatter(adopt_data_Xax,​ adopt_data_Yax,​ facecolor='​y',​ edgecolor='​y'​)
 +ax = plt.gca()
 +ax.set_xlabel('​Number of Visas in 1988')
 +ax.set_ylim([0,​2700])
 +ax.set_xlim([0,​5000])
 +ax.set_ylabel('​Number of Visas in 1991')
 +ax.set_title('​ScatterPlot of Visa91 vs Visa88'​)
 +plt.show()
 +</​code>​
 +
 +
 +\\
 +
 +
 +=====Scatterplot (p.18)=====
 +<code python>
 +import matplotlib.pyplot as plt
 +import pandas as pd
 +adoption_scatterplot = pd.read_csv('​D:​\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\\adopt.TAB',​ '​\t'​)
 +adopt_data_Xax = adoption_scatterplot['​Visa91'​]
 +adopt_data_Yax = adoption_scatterplot['​Visa92'​]
 +plt.scatter(adopt_data_Xax,​ adopt_data_Yax,​ facecolor='​y',​ edgecolor='​y'​)
 +ax = plt.gca()
 +ax.set_xlabel('​Number of Visas in 1991')
 +ax.set_ylim([0,​1800])
 +ax.set_xlim([0,​2700])
 +ax.set_ylabel('​Number of Visas in 1992')
 +ax.set_title('​ScatterPlot of Visa92 vs Visa91'​)
 +plt.show()
 +</​code>​
 +
 +\\
 +
 +
 +====== The Performance of stock mutual funds (p. 21) ======
 +
 +
 +
 +
 +
 +
 +\\
 +
 +====== Predicting the sales and airplay of popular music (p. 23)======
 +
 +
 +
 +
 +\\
 +
 +====== Another look at the "Old faithful"​ geyser and adoption visas (p.24) ======
 +
 +Modified the bins of the both histograms:
 +The Histogram is reliable for the "Old faithful"​ geyser but not for the Adoption rates. The appearance of the histogram changes quite a lot by changing the bins.
 +
 +\\
 +
 +====== Productivity versus quality in the assembly plant (p. 25)======
 +
 +
 +===== Scatterplot of Productivity vs Quality (p. 26) =====
 +<code Python>
 +import pandas as pd
 +import matplotlib.pyplot as plt
 +scatter_plot = pd.read_csv('​D:​\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\\prdq.TAB',​ '​\t'​)
 +productivity_Y = scatter_plot['​Producti'​]
 +quality_X = scatter_plot['​Quality'​]
 +plt.scatter(productivity_Y,​ quality_X, bins=20, colors='​r'​)
 +ax = plt.gca()
 +ax.set_Xlabel('​Assembly defects per 100 cars')
 +ax.set_Ylabel('​Hours per vehicle'​)
 +ax.set_title('​Scatter Plot of Productivity VS Quality'​)
 +plt.show()
 +</​code>​
 +
 +\\
 +
 +=====Scatter Plot of PRODJAPN vs QUALJAPN (p. 27) =====
 +
 +<code Python>
 +import pandas as pd
 +import matplotlib.pyplot as plt
 +scatter_plot = pd.read_csv('​D:​\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\\prdq.TAB',​ '​\t'​)
 +productivity_Y = scatter_plot['​ProdJapn'​]
 +quality_X = scatter_plot['​QualJapn'​]
 +plt.scatter(productivity_Y,​ quality_X, bins=20, colors='​r'​)
 +ax = plt.gca()
 +ax.set_Xlabel('​Assembly defects per 100 cars (Japanese origin)'​)
 +ax.set_Ylabel('​Hours per vehicle (Japanese origin'​)
 +ax.set_title('​Scatter Plot of PRODJAPN VS QUALJAPN'​)
 +plt.show()
 +</​code>​
 +
 +
 +=====Scatter Plot of PRODNONJ cs QUALNONJ (p. 27)=====
 +<code Python>
 +import pandas as pd
 +import matplotlib.pyplot as plt
 +scatter_plot = pd.read_csv('​D:​\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\\prdq.TAB',​ '​\t'​)
 +productivity_Y = scatter_plot['​ProdNonJ'​]
 +quality_X = scatter_plot['​QualNonJ'​]
 +plt.scatter(productivity_Y,​ quality_X, bins=20, colors='​r'​)
 +ax = plt.gca()
 +ax.set_Xlabel('​Assembly defects per 100 cars (non-Japanese origin)'​)
 +ax.set_Ylabel('​Hours per vehicle (non-Japanese origin'​)
 +ax.set_title('​Scatter Plot of PRODNONJ VS QUALNONJ'​)
 +plt.show()
 +</​code>​
 +
 +
 +
 +===== Scatterplot of productivity VS quality (p. 28) =====
 +<code python>
 +import pandas as pd
 +import matplotlib.pyplot as plt
 +scatter_plot = pd.read_csv('​D:​\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\\prdq.TAB',​ '​\t'​)
 +productivity_Y = scatter_plot['​Producti'​]
 +quality_X = scatter_plot['​Quality'​]
 +plt.scatter(productivity_Y,​ quality_X, bins=20, colors='​r'​)
 +ax = plt.gca()
 +ax.set_Xlabel('​Assembly defects per 100 cars')
 +ax.set_Ylabel('​Hours per vehicle'​)
 +ax.set_title('​Scatter Plot of PRODUCTIVITY VS QUALITY'​)
 +plt.show()
 +</​code>​
 +
 +
 +===== Productivity versus quality in the assembly plant (p.29) =====
 +
 +It worked the first time but now it doesn'​t work again. Maybe again a windows error?
 +
 +<code python>
 +#1
 +import matplotlib.pyplot as plt
 +import pandas as pd
 +data_comparison = pd.read_csv('​D:​\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\prdq.TAB',​ '​\t'​)
 +non_japanese = data_comparison.loc[data_comparison['​QualNonJ'​]]
 +japanese = data_comparison.loc[data_comparison['​QualJapn'​]]
 +plt.boxplot([non_japanese['​Quality'​],​japanese['​Quality'​]],​ labels= ['​Non-japanese','​Japanese'​])
 +plt.show()
 +
 +#2
 +import matplotlib.pyplot as plt
 +import pandas as pd
 +data_comparison = pd.read_csv('​D:​\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\prdq.TAB',​ '​\t'​)
 +non_japanese = data_comparison.loc[data_comparison['​ProdNonJ'​]]
 +japanese = data_comparison.loc[data_comparison['​ProdJapn'​]]
 +plt.boxplot([non_japanese['​Producti'​],​japanese['​Producti'​]],​ labels= ['​Non-japanese','​Japanese'​])
 +plt.show()
 +</​code>​
python/first_course_statistics.1476682908.txt.gz · Dernière modification: 2016/10/17 07:41 par Francesco Beretta