Revenir à l'[[python:accueil|accueil]]


==== Technologies HTML et complémentaires ====

  * [[https://www.w3schools.com/html/default.asp|Cours de HTML en anglais]]
  * [[http://www.startyourdev.com/html/tag-html-index|Référence pour le HTML, CSS, XSL, etc.]] en français


====  Récupérer des pages HTML et les transformer en XML  ====


  * LXML
    * [[http://adrien.barbaresi.eu/blog/parsing-converting-lxml-html-tei.html|Parsing and converting HTML documents to XML/TEI format using Python’s lxml]]
    * [[https://pythontips.com/2018/06/20/an-intro-to-web-scraping-with-lxml-and-python/|Tutoriel avec exemple]]
      * [[https://www.youtube.com/watch?v=5N066ISH8og|Vidéo du même tutoriel]]
  * BeautifulSoup
    * [[https://programminghistorian.org/en/lessons/intro-to-beautiful-soup|Programming historian: Intro to Beautiful Soup]]
    * [[https://digitallatin.org/blog/using-beautifulsoup-add-works-dlls-database|Using BeautifulSoup to add works to the DLL's database]]
  * Trafilatura
    * Une nouvelle librairie en cours de développement, utile et clés en main, parfois un peu limitée dans les possitilités de choix (en fonction de la compléxité de la page HTML)
    * [[https://github.com/adbar/trafilatura|Trafilatura sur GitHub]]
    * [[http://adrien.barbaresi.eu/blog/trafilatura-main-text-content-python.html|Extracting the main text content from web pages using Python]] 
  * [[https://scrapy.org/|Scrapy]]
    * YouTube: [[https://www.youtube.com/watch?v=ve_0h4Y8nuI&list=PLhTjy8cBISEqkN-5Ku_kXG4QW33sxQo0t|Tutoriel complet]]
    * [[https://fr.wikipedia.org/wiki/Scrapy|Présentation sur Wikipaedia]]
    * [[https://docs.scrapy.org/en/latest/intro/overview.html|Scrapy at a glance]]
    * [[https://docs.scrapy.org/en/latest/|Documentation]]


{{:python:war_entities_stag.csv.zip|}}