Streaming HTTP handlers

Services can be used for ingest and egress of data. In current CDAP (3.2.0), however, there are limitations to what you can do:

  • Every method call of a service handler is executed in a transaction. The typical transaction timeout is configured at around 30 seconds. That means, if the handler methods needs longer than that to complete, the transaction will fail.
  • The content of the HTTP request is always buffered up in memory, hence the handler cannot receive large data. It would be better to stream the content. 
  • In case of transaction conflicts, the handler has no control over handling that error. 

Here are some use cases where these limitations get in the way:

  1. A service handler to upload partitions to a partitioned file set:
    • With each request, a large file is received. 
    • Meta data about the file is received in the HTTP headers
    • Based on the meta data, the handler determines the partition key for the file
    • The content of the request is consumed and streamed to a file
    • The handler validates the file (possible using a checksum, or validating its size or number of records)
    • The handler may also parse the content as it is streamed and validate it using lookups in a dataset. 
    • The handler registers the file as a new partition
    • If an error occurs in any of these steps, the file must be deleted, or moved to a quarantine area; possibly a record of the error needs to be saved to a dataset
    • If there is a transaction conflict, the same applies. 
    • Also, in case of an error, the handler has control over the HTTP response
  2. A service handler to download large files:
    • Similar to 1., with the exception that this is simpler because no writes happen (and no conflicts) 
    • Also, the request is small but the response may be very large and take a long time to send.
  3. A handler to receive a sequence of records, and to process them one by one
    • Processing a record may mean storing it in a dataset, or lookup in a dataset
    • The response may indicate how many records were successfully processed (some may have conflicts)
    • The response may contain a new record for every record received.
    • The processing should continue in case of an error (even a transaction conflict). 
    • Possibly each record must be processed in its own transaction 


 

 

Â