Différences

Ci-dessous, les différences entre deux révisions de la page.

--- python:first_course_statistics [2016/10/17 07:41]
Francesco Beretta [ScatterPlot (p. 7)]
+++ python:first_course_statistics [2017/09/26 08:54] (Version actuelle)
Francesco Beretta [General instructions]
@@ Ligne 5: / Ligne 5: @@
   * pandas [[http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe|dataframes]]
   * [[http://matplotlib.org/api/pyplot_summary.html|matplotlib.pyplot]]
+Get the data from [[http://people.stern.nyu.edu/jsimonof/Casebook/Data/ASCII/README.html|this site]].
 Save your scripts in a folder inside the data folder, calling the script folder 'my_scripts' or whaterver. If  'my-scripts' is set as your [[python:generic_features#get_the_current_working_directory_address|current working directory]], then the data files are available under this address '../[data file]', for instantce: '../geyser1.TAB'
@@ Ligne 35: / Ligne 39: @@
 import matplotlib.pyplot as plt
 import pandas as pd
-gysr1_boxplot = pd.read_csv('...\geyser1.TAB', '\t')
+gysr1_boxplot = pd.read_csv('.../geyser1.TAB', '\t')
 data_gysr1 = gysr1_boxplot['Interval']
 plt.boxplot(data_gysr1)
@@ Ligne 55: / Ligne 59: @@
 import matplotlib.pyplot as plt
 import pandas as pd
-geysr1_scatterplot = pd.read_csv('...\geyser1.TAB', '\t')
+geysr1_scatterplot = pd.read_csv('.../geyser1.TAB', '\t')
 geysr1_data_Xax = geysr1_scatterplot['Duration']
 geysr1_data_Yax = geysr1_scatterplot['Interval']
@@ Ligne 65: / Ligne 69: @@
 plt.show()
 </code>
+\\
+===== Descriptive statistics (p.9) =====
+Note: try different examples, e.g. the whole population or only those where 'Duration' <= 3, the whole dataframe
+[[http://pandas.pydata.org/pandas-docs/stable/basics.html#descriptive-statistics|doc]] – [[http://www.marsja.se/pandas-python-descriptive-statistics/|example]]
+<code python>
+import pandas as pd
+gysr1 = pd.read_csv('../geyser1.tab', '\t')
+gysr1['Duration'][gysr1['Duration'] <= 3].describe()
+</code>
+\\
 ===== Boxplot (p.9) =====
@@ Ligne 78: / Ligne 102: @@
 plt.boxplot([gysr1_inf3['Interval'],gysr1_sup3['Interval']], labels= ['inf3','sup3'])
 </code>
+\\
 ====== International adoption rates (p.13) ======
+===== Boxplot (p.14) =====
+<code python>
+import matplotlib.pyplot as plt
+import pandas as pd
+adopt_data = pd.read_csv('D:\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\\adopt.TAB', '\t')
+adopt1 = adopt_data['Visa91']
+plt.boxplot(adopt1)
+ax = plt.gca()
+ax.set_title('Box and Whisker Plot')
+ax.set_xlabel('39 cases')
+ax.set_ylabel('Number of visas in 1991')
+plt.show()
+</code>
+\\
+===== Histogram (p.14) =====
+<code python>
+import matplotlib.pyplot as plt
+import pandas as pd
+adopt_data = pd.read_csv('D:\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\\adopt.TAB', '\t')
+adopt1 = adopt_data['Visa91']
+plt.hist(adopt1)
+plt.show()
+</code>
+\\
+=====Histogram with Log(p.18)=====
+don't find the way to do it
+<code Python>
+import pandas as pd
+import matplotlib.pyplot as plt
+adopt = pd.DataFrame(pd.read_csv('D:\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\\adopt.TAB', '\t'))
+adopt_loghist = adopt['Visa91']
+#adopt_loghist.semilogx() --> was one of the possibilities
+ax = plt.gca()
+ax.hist(adopt_loghist, bins=10, plt.loglog(0.5,3.5), color='r') #put log=True instead, but you will get the log for the frequencies
+plt.gca().set_xscale("log")
+ax.set_xlabel('Log (Number of 1991 visas')
+ax.set_ylabel('Frequency')
+ax.set_title('Histogram')
+plt.show()
+</code>
+=====Scatterplot (p. 17)=====
+<code python>
+import matplotlib.pyplot as plt
+import pandas as pd
+adoption_scatterplot = pd.read_csv('...\adopt.TAB', '\t')
+adopt_data_Xax = adoption_scatterplot['Visa88']
+adopt_data_Yax = adoption_scatterplot['Visa91']
+plt.scatter(adopt_data_Xax, adopt_data_Yax, facecolor='y', edgecolor='y')
+ax = plt.gca()
+ax.set_xlabel('Number of Visas in 1988')
+ax.set_ylim([0,2700])
+ax.set_xlim([0,5000])
+ax.set_ylabel('Number of Visas in 1991')
+ax.set_title('ScatterPlot of Visa91 vs Visa88')
+plt.show()
+</code>
+\\
+=====Scatterplot (p.18)=====
+<code python>
+import matplotlib.pyplot as plt
+import pandas as pd
+adoption_scatterplot = pd.read_csv('D:\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\\adopt.TAB', '\t')
+adopt_data_Xax = adoption_scatterplot['Visa91']
+adopt_data_Yax = adoption_scatterplot['Visa92']
+plt.scatter(adopt_data_Xax, adopt_data_Yax, facecolor='y', edgecolor='y')
+ax = plt.gca()
+ax.set_xlabel('Number of Visas in 1991')
+ax.set_ylim([0,1800])
+ax.set_xlim([0,2700])
+ax.set_ylabel('Number of Visas in 1992')
+ax.set_title('ScatterPlot of Visa92 vs Visa91')
+plt.show()
+</code>
+\\
+====== The Performance of stock mutual funds (p. 21) ======
+\\
+====== Predicting the sales and airplay of popular music (p. 23)======
+\\
+====== Another look at the "Old faithful" geyser and adoption visas (p.24) ======
+Modified the bins of the both histograms:
+The Histogram is reliable for the "Old faithful" geyser but not for the Adoption rates. The appearance of the histogram changes quite a lot by changing the bins.
+\\
+====== Productivity versus quality in the assembly plant (p. 25)======
+===== Scatterplot of Productivity vs Quality (p. 26) =====
+<code Python>
+import pandas as pd
+import matplotlib.pyplot as plt
+scatter_plot = pd.read_csv('D:\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\\prdq.TAB', '\t')
+productivity_Y = scatter_plot['Producti']
+quality_X = scatter_plot['Quality']
+plt.scatter(productivity_Y, quality_X, bins=20, colors='r')
+ax = plt.gca()
+ax.set_Xlabel('Assembly defects per 100 cars')
+ax.set_Ylabel('Hours per vehicle')
+ax.set_title('Scatter Plot of Productivity VS Quality')
+plt.show()
+</code>
+\\
+=====Scatter Plot of PRODJAPN vs QUALJAPN (p. 27) =====
+<code Python>
+import pandas as pd
+import matplotlib.pyplot as plt
+scatter_plot = pd.read_csv('D:\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\\prdq.TAB', '\t')
+productivity_Y = scatter_plot['ProdJapn']
+quality_X = scatter_plot['QualJapn']
+plt.scatter(productivity_Y, quality_X, bins=20, colors='r')
+ax = plt.gca()
+ax.set_Xlabel('Assembly defects per 100 cars (Japanese origin)')
+ax.set_Ylabel('Hours per vehicle (Japanese origin')
+ax.set_title('Scatter Plot of PRODJAPN VS QUALJAPN')
+plt.show()
+</code>
+=====Scatter Plot of PRODNONJ cs QUALNONJ (p. 27)=====
+<code Python>
+import pandas as pd
+import matplotlib.pyplot as plt
+scatter_plot = pd.read_csv('D:\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\\prdq.TAB', '\t')
+productivity_Y = scatter_plot['ProdNonJ']
+quality_X = scatter_plot['QualNonJ']
+plt.scatter(productivity_Y, quality_X, bins=20, colors='r')
+ax = plt.gca()
+ax.set_Xlabel('Assembly defects per 100 cars (non-Japanese origin)')
+ax.set_Ylabel('Hours per vehicle (non-Japanese origin')
+ax.set_title('Scatter Plot of PRODNONJ VS QUALNONJ')
+plt.show()
+</code>
+===== Scatterplot of productivity VS quality (p. 28) =====
+<code python>
+import pandas as pd
+import matplotlib.pyplot as plt
+scatter_plot = pd.read_csv('D:\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\\prdq.TAB', '\t')
+productivity_Y = scatter_plot['Producti']
+quality_X = scatter_plot['Quality']
+plt.scatter(productivity_Y, quality_X, bins=20, colors='r')
+ax = plt.gca()
+ax.set_Xlabel('Assembly defects per 100 cars')
+ax.set_Ylabel('Hours per vehicle')
+ax.set_title('Scatter Plot of PRODUCTIVITY VS QUALITY')
+plt.show()
+</code>
+===== Productivity versus quality in the assembly plant (p.29) =====
+It worked the first time but now it doesn't work again. Maybe again a windows error?
+<code python>
+#1
+import matplotlib.pyplot as plt
+import pandas as pd
+data_comparison = pd.read_csv('D:\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\prdq.TAB', '\t')
+non_japanese = data_comparison.loc[data_comparison['QualNonJ']]
+japanese = data_comparison.loc[data_comparison['QualJapn']]
+plt.boxplot([non_japanese['Quality'],japanese['Quality']], labels= ['Non-japanese','Japanese'])
+plt.show()
+#2
+import matplotlib.pyplot as plt
+import pandas as pd
+data_comparison = pd.read_csv('D:\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\prdq.TAB', '\t')
+non_japanese = data_comparison.loc[data_comparison['ProdNonJ']]
+japanese = data_comparison.loc[data_comparison['ProdJapn']]
+plt.boxplot([non_japanese['Producti'],japanese['Producti']], labels= ['Non-japanese','Japanese'])
+plt.show()
+</code>

Wiki de l'ARHNAxe de recherche en histoire numériqueLARHRA UMR5190

Outils pour utilisateurs

Outils du site

Différences

Outils de la page

Wiki de l'ARHN

Axe de recherche en histoire numérique
LARHRA UMR5190