URIs with encoded characters being decoded prior to resolution

Hi all!

The URLs I am using for my LDP Containers include ISO timestamps; however, because the ‘:’ character in URIs is intended as a separator, I encode these characters in the Container and Resource URL. It appears, however, that Virtuoso is decoding the path portion of the URL prior to resolving it, resulting in 404s. See interaction below (the Container was created using a PUT + Slug Header with the encoded timestamp in the slug)

curl -v -i -H "Accept: text/turtle"
'http://evaluations.fairdata.solutions/DAV/home/LDP/evals/da613f1b-bd1b-5d7c-8e58-190a308e226c/2018-05-04T11%3A26%3A42.757Z/'
*   Trying 35.187.51.96...
* TCP_NODELAY set
* Connected to evaluations.fairdata.solutions (35.187.51.96) port 80 (#0)
* Server auth using Basic with user 'XXXXX'
> GET
/DAV/home/LDP/evals/da613f1b-bd1b-5d7c-8e58-190a308e226c/2018-05-04T11%3A26%3A42.757Z/
HTTP/1.1
> Host: evaluations.fairdata.solutions
> Authorization: Basic ZGJhOmZhaXJldmFsdWF0b3I=
> User-Agent: curl/7.60.0
> Accept: text/turtle
>
< HTTP/1.1 404 Not Found
HTTP/1.1 404 Not Found
< Date: Mon, 05 Nov 2018 13:52:32 GMT
Date: Mon, 05 Nov 2018 13:52:32 GMT
< Server: Virtuoso/07.20.3217 (Linux) x86_64-pc-linux-gnu
Server: Virtuoso/07.20.3217 (Linux) x86_64-pc-linux-gnu
< X-Frame-Options: SAMEORIGIN
X-Frame-Options: SAMEORIGIN
< Content-Type: text/html; charset=UTF-8
Content-Type: text/html; charset=UTF-8
< Accept-Ranges: bytes
Accept-Ranges: bytes
< Content-Length: 266
Content-Length: 266

<
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html>
  <head>
    <title>404 Not Found</title>
  </head>
  <body>
    <h1>Not found</h1>
    Resource
/DAV/home/LDP/evals/da613f1b-bd1b-5d7c-8e58-190a308e226c/2018-05-04T11:26:42.757Z/
not found.
  </body>
* Connection #0 to host evaluations.fairdata.solutions left intact
</html>

This looks like a bug to me in that Virtuoso is converting da613f1b-bd1b-5d7c-8e58-190a308e226c/2018-05-04T11%3A26%3A42.757Z to da613f1b-bd1b-5d7c-8e58-190a308e226c/2018-05-04T11:26:42.757Z at the HTTP data transmission level.

I’ll have it looked into further.

It’s worth noting that unescaped colons are legal in the Path segment of a URI, as are all the other characters here –

a-zA-Z0-9!$&'()*+,;=:@~_.-

Thus, colons SHOULD NOT be encoded in this segment.

That said, %3A and : are supposed to be semantically equivalent in this segment, so these should be 100% interchangeable –

  • http://evaluations.fairdata.solutions/DAV/home/LDP/evals/da613f1b-bd1b-5d7c-8e58-190a308e226c/2018-05-04T11%3A26%3A42.757Z/

  • http://evaluations.fairdata.solutions/DAV/home/LDP/evals/da613f1b-bd1b-5d7c-8e58-190a308e226c/2018-05-04T11:26:42.757Z/

Yes, that’s correct. When I use a colon character, the problem does not arise (i.e., I am able to PUT and GET the record with that URL). However, the intended “semantics” of a colon character in the path do not match with the way I use them (my reading of the spec is that a colon is intended to be a separator of some meaningful kind v.v. the URI), so I encoded them. This is when the problem begins. They should be equivalent URIs, but they are not.

my reading of the spec is that a colon is intended to be a separator of some meaningful kind v.v. the URI

This is a misunderstanding. In the path portion of a URI, the colon is not intended to be a separator; it is just another pchar (see the last line of this BNF snippet) –

      path          = path-abempty    ; begins with "/" or is empty
                    / path-absolute   ; begins with "/" but not "//"
                    / path-noscheme   ; begins with a non-colon segment
                    / path-rootless   ; begins with a segment
                    / path-empty      ; zero characters

      path-abempty  = *( "/" segment )
      path-absolute = "/" [ segment-nz *( "/" segment ) ]
      path-noscheme = segment-nz-nc *( "/" segment )
      path-rootless = segment-nz *( "/" segment )
      path-empty    = 0<pchar>

      segment       = *pchar
      segment-nz    = 1*pchar
      segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
                    ; non-zero-length segment without any colon ":"

      pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

So, yes, there is a bug in Virtuoso (which we will be addressing), in that we are not treating these URIs as equivalent, but there is also a bug in your URI generation, in that these colons should not be encoded. If you do not encode them (as would be correct), then our bug would not bite you.