Sistemas y Tecnologías Web: Servidor

Master de II. ULL. 1er cuatrimestre. 2020/2021


Organization ULL-MII-SYTWS-2021   Classroom ULL-MII-SYTWS-2021   Campus Virtual SYTWS   Chat Chat   Profesor Casiano

Table of Contents

Práctica Transforming Data and Testing Continuously (p9-t3-transfoming-data)

Extracting Classification Codes

  • When extracting fields from the Project Gutenberg RDF (XML) files, in Traversing the Document, we specifically selected the Library of Congress Subject Headings (LCSH) and stored them in an array called subjects.
  • At that time, we carefully avoided the Library of Congress Classification (LCC) single-letter codes. Recall that the LCC portion of an RDF file looks like this:

data/cache/epub/132/pg132.rdf

1
2
3
4
5
6
<dcterms:subject>
  <rdf:Description rdf:nodeID="Nfb797557d91f44c9b0cb80a0d207eaa5">
    <dcam:memberOf rdf:resource="http://purl.org/dc/terms/LCC"/>
    <rdf:value>U</rdf:value>
  </rdf:Description>
</dcterms:subject>

Using your BDD infrastructure built on Mocha and Chai, implement the following:

  • Add a new assertion to parse-rdf-test.js that checks for book.lcc.
  • It should be of type string and it should be at least one character long.
  • It should start with an uppercase letter of the English alphabet, but not I, O, W, X, or Y.
  1. Run the tests to see that they fail.
  2. Add code to your exported module function in parse-rdf.js to make the tests pass.
Ayuda
  • Busca por un elemento con un atributo rdf:resource que termine en /LCC
  • Luego vete al padre de este elemento
  • Encuentra el texto del primer descendiente rdf:value

Extracting Sources

Most of the metadata in the Project Gutenberg RDF files describes where each book can be downloaded in various formats.

For example, here’s the part that shows where to download the plain text of The Art of War:

data/cache/epub/132/pg132.rdf

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
<dcterms:hasFormat>
  <pgterms:file rdf:about="http://www.gutenberg.org/ebooks/132.txt.utf-8">
    <dcterms:isFormatOf rdf:resource="ebooks/132"/>
    <dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">
    2016-09-01T01:20:00.437616</dcterms:modified>
    <dcterms:format>
      <rdf:Description rdf:nodeID="N2293d0caa918475e922a48041b06a3bd">
        <dcam:memberOf rdf:resource="http://purl.org/dc/terms/IMT"/>
        <rdf:value
        rdf:datatype="http://purl.org/dc/terms/IMT">text/plain</rdf:value>
      </rdf:Description>
    </dcterms:format>
    <dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">
    343691</dcterms:extent>
  </pgterms:file>
</dcterms:hasFormat>

        ...

<dcterms:hasFormat>
      <pgterms:file rdf:about="http://www.gutenberg.org/ebooks/132.kindle.noimages">
        <dcterms:isFormatOf rdf:resource="ebooks/132"/>
        <dcterms:modified rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2015-08-01T01:24:38.440052</dcterms:modified>
        <dcterms:extent rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">598678</dcterms:extent>
        <dcterms:format>
          <rdf:Description rdf:nodeID="N90d807c6b2a042078ac4e05e8e265dd7">
            <rdf:value rdf:datatype="http://purl.org/dc/terms/IMT">application/x-mobipocket-ebook</rdf:value>
            <dcam:memberOf rdf:resource="http://purl.org/dc/terms/IMT"/>
          </rdf:Description>
        </dcterms:format>
      </pgterms:file>
</dcterms:hasFormat>

Suppose we wanted to include a list of download sources in each JSON object we create from an RDF file.

To get an idea of what data you might want, take a look at the Project Gutenberg page for The Art of War.

Consider these questions:

  • Which fields in the raw data would we want to capture, and which could we discard?
  • What structure would make the most sense for this data?
  • What information would you need to be able to produce a table that looked like the one on the Project Gutenberg site?

Once you have an idea of what data you’ll want to extract, try creating a JSON object by hand for this one download source. When you’re happy with your data representation, use your existing continuous testing infrastructure and add a test that checks for this new information.

Finally, extend the book object produced in parse-rdf.js to include this data to make the test pass.

Descripción del Reto

Recursos

Comment with GitHub Utterances

Comment with Disqus

thread de discusion