Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Introduction

...

For each identified attribute, the statement returns the following details:

  • #docs: Specifies the number of documents in the sample that contain this attribute.

  • %docs: Specifies the percentage of documents in the sample that contain this attribute.

  • minitems: If the data type is an array, specifies the minimum number of elements (array size).

  • maxitems: If the data type is an array, specifies the maximum number of elements (array size).

  • samples: Displays a list of sample values for the attribute found in the sample population.

  • type: Specifies the identified data type of the attribute.

Thus, we can use type and samples info to infer schema field type.

However, there are some concerns:

  • Couchbase's number type can be mapped to a CDAP int, long, double and decimal. To infer actual CDAP type we can do our best by analyzing samples. 

          Let's suppose all values of some property less than java.lang.Integer.MAX_VALUE, but one equals to java.lang.Long.MAX_VALUE. In the case, when samples do not contain java.lang.Long.MAX_VALUE, we will infer invalid CDAP type.          

  • There is no way to determine if a field is nullable. The proposal is to make all fields nullable and let the user change this manually,

  • INFER does not honor the SELECT query and returns all documents attributes. If we want to filter out schema fields according to the specified query, we have to manually parse the query.

  • INFER statement supported since Couchbase Server 4.5, so we won't be able to support versions [4.0-4.5).


Source Splitter

The proposal is to add "Number of Splits" Source configuration property, which allows specifying the desired number of splits to divide the query into when reading from Couchbase. 

Fewer splits may be created if the query cannot be divided into the desired number of splits.

Also, we can use '0' as the default value for this configuration property and determine the number of splits according to the number of map tasks (controlled by the "mapreduce.job.maps" property):

Code Block

public List<InputSplit> getSplits(JobContext job) throws IOException {

    ...

    int targetNumTasks = job.getConfiguration().getInt(MRJobConfig.NUM_MAPS, 1);
    
    ...


'SELECT COUNT(*)' query can be used in order to get a total number of documents, that will be divided between splits using 'OFFSET' and 'LIMIT'.


Source Properties

Section

User Configuration LabelLabel DescriptionOptionsDefaultVariableUser Widget
GeneralLabelLabel for UI.


textbox

Reference NameUniquely identified name for lineage.

referenceNametextbox

NodesList of nodes to use when connecting to the Couchbase cluster.

nodescsv

BucketCouchbase Bucket name.

buckettextbox

Select FieldsComma-separated list of fields to be read.
*selectFieldstextbox

ConditionsOptional criteria (filters or predicates) that the result documents must satisfy. Corresponds to the WHERE clause in N1QL SELECT statement.

conditionstextbox

Output SchemaSpecifies the schema of the documents.

schemaschema
CredentialsUsernameUser identity for connecting to the Couchbase.

usernametextbox

PasswordPassword to use to connect to the Couchbase.

passwordpassword
Error HandlingOn Record ErrorHow to handle error in record processing.
  • Skip error
  • Send to error
  • Fail pipeline

Fail pipeline

on-errorradio-group (layout: block)
AdvancedMax ParallelismMaximum number of CPU cores can be used to process a query. If the specified value is less than zero or greater than the total number of cores in a cluster, the system will use all available cores in the cluster.
0maxParallelismnumber

Scan ConsistencySpecifies the consistency guarantee or constraint for index scanning
  • Not Bounded
  • At Plus
  • Request Plus
  • Statement Plus
Not BoundedscanConsistencyselect

Query TimeoutNumber of seconds to wait before a timeout has occurred on a query. 
600timeoutnumber

...

The source requires Output Schema to be set. Based on the schema source will expect a field in each document to be of a specific Couchbase data type.

On Record Error error handling property allows the user to decide whether the pipeline should fail, the record should be skipped, or the record should be sent to the error dataset.

...

SectionUser Configuration LabelLabel DescriptionOptionsDefaultVariableUser Widget
GeneralLabelLabel for UI.


textbox

Reference NameUniquely identified name for lineage.

referenceNametextbox

NodesList of nodes to use when connecting to the Couchbase cluster.

nodescsv

BucketCouchbase Bucket name.

buckettextbox

Key Field

Allows the user to specify which of the incoming fields should be used as a document identifier.

Identifier is expected to be of type string.



keyFieldinput-field-selector

OperationType of write operation to perform. This can be set to Insert, Replace or Upsert.
  • Insert
  • Replace
  • Upsert

Insert

operationradio-group
CredentialsUsernameUser identity for connecting to the Couchbase.

usernametextbox

PasswordPassword to use to connect to the Couchbase.

passwordpassword
AdvancedBatch SizeSize (in number of records) of the batched writes to the Couchbase bucket. Each write to Couchbase contains some overhead. To maximize bulk write throughput, maximize the amount of data stored per write. Commits of 1 MiB usually provide the best performance. Default value is 100 records.
100batchSizenumber

...