Convert to HTML

Pigro Intelligent Search API accept documents in HTML or plain text for uploading. If your documents are not in this format you can use our convert To HTML API to convert your files into clean, safe HTML text. We support the most common document formats (Word, PowerPoint, Excel, and OpenOffice formats). PDF documents are not supported in the free version but can be activated in the premium one. Because the conversion is a time-intensive process the API is asynchronous. The response to the API call will confirm the document has been received correctly, and when the conversion is completed the result will be sent to the indicated webhook. The request would look like this:

curl --location 
--request POST 'https://api.pigro.ai/convert_to_html?webhook=https://your_webhook.url\
	&webhook_images=https://your_images_webhook.url \
    &img_management=url' \
--header 'Content-Type: multipart/form-data' \
--header 'x-api-key: Your32CharactersApiKeyHere_' \
--form 'file=@"/your_file.docx"'
  • file: This is the file to be imported as binary in the request form.
  • webhook: The URL where the conversion result will be sent. The request will be in JSON format. In case of success, it looks like this:
{
  "status": True,
  "converted_document": "your converted HTML document"
}
  • img_management: It can be internal or url. As documents can contain images, these need to be inserted in the HTML code. Pigro currently offers two possibilities:
  1. internal: the image files are encoded as base64, and inserted inline in the "src" attribute of the images.
  2. url: when an image is encountered, it will be POSTed as a binary file on the client-provided webhook (parameter webhook_images).
  • webhook_images: The URL that should receive a POST request from the conversion and provide a url to put in the src attribute of the image tag. Only needed if img_managemenet is set to url. The response body from the webhook_images url is expected to be as follows (status code 200):
{
    "url":"http://the.url/to/put/in/src"
}

the given URL will be put as src of the image. It can be either absolute or relative, the conversion will insert the exact text that has been given.

In case internal mode is used, webhook_images can be omitted by the request.

The API will respond to let the client know if the conversion has been correctly prepared. In this case it returns a JSON like the following:

{
	"message": "File has been converted and the webhook should be called with its data..."
}

Upload

After converting your documents you can now upload them in order to build your library. Remember to set your language before uploading your documents (see more about this in Getting started).

Add multiple documents

📘

Important!

Remember to call the Training API after adding multiple documents to let them appear in search!

To add multiple documents you can use the Add Documents API. With this API you can upload all your documents in a single API call. You should prepare your data with the following JSON structure:

{
  "documents": [
    {
      "id": 1,
      "title": "Document title",
      "body": "The full content of the document.\
			It can be some plain text or HTML enriched content"
    },
    {
      "id": "aaCjAH",
      "title": "You can upload more than one document",
      "body": "You just need to separate them in a list"
    },
    {
      "id": "identifier_111",
      "title": "You can use any unique identifier",
      "body": "Be sure that your identifier is indeed unique.\
			Duplicated documents won't be added to your library"
    }
  ]
}
  • The id must be unique, it can be an integer or a string. Documents uploaded with an already present id in your library won't be uploaded
  • The body of the document can be plain text or HTML content. If the latter, be sure that the HTML content is a string and not bytes-like objects

Then you can use the Add Documents API to actually upload them. After sending the data you will need to train your library for the documents to actually be searchable, this step is better explained in the Train section.

Add a single document

If you need to upload a single document you can use the Add Single Document API. The structure of the data is very similar to the previous one:

{
  "document": {
      "id": 1,
      "title": "Document title",
      "body": "The full content of the document.\
			It can be some plain text or HTML enriched content"
    }
}
  • The id must be unique, it can be an integer or a string. Documents uploaded with an already present id in your library won't be uploaded
  • The body of the document can be plain text or HTML content. If the latter, be sure that the HTML content is a string and not bytes-like objects

In this case there is no need to call the Train API after uploading.

Update your documents

If you want to update the documents you previously uploaded you can do that by specifying the new titles and/or bodies. Remember to use the same identifier you used during the upload phase.

Update multiple documents

📘

Important!

Remember to call the Training API after updating multiple documents to let the modifications appear in search!

To update more than one document you can use the Update Documents API. Your data should be formatted in this way:

{
     "documents": [
          {
               "title": "First document title",
               "body": "This is the content of the first document",
               "id": 1
          },
          {
               "title": "Second document title",
               "body": "This is the content of the second document",
               "id": "aCXHHjko"
          }
     ]
}

The behaviour is the same as in the upload documents API. After updating your documents, you need to call the Train API for the change to be effective.

Update single document

If you need to update only a single document you can use the single Update Single Document API with your data structured like this:

{
  "document": {
    "title": "First and only document title",
    "body": "This is the content of the first and only document",
    "id": 1
  }
}

Again in this case you should not call the Train API after updating your document to sync your library

Delete

To delete documents from your library you should use the Delete API. It takes simply a list of ids corresponding to the documents you want to remove from your library. Like this:

{
  "ids": [12, "aCXHHjko", 42]
}

In case some of the ids are not present in your library you will still receive a 200 response with a warning, like this:

{
  "message": "Deletion went well. Some ids were not found: [..."
}

If none of the ids are present you will receive a 404 response:

{
  "message": "Ids not found"
}

After deleting documents you won't have to train the library to sync the modifications.

Upload alias

Our solution doesn't need a specific configuration based on typology of uploaded contents. In most cases this functionality it is not necessary and you can skip this step. There are some cases however where specific terms and particular acronyms or product names may need you to do use synonyms or "alias" to have better results in your searches.

An example: a bank has a particular promotion on a certain type of bank account called "Under30PromotionAccount". In this case it may be wanted that every time a user looks for a generic bank account this promoted one comes up in the results.

You can upload your own alias dictionary by specifying the words for which you want alias to be considered as well in your search.
To enable the search with alias remember to update the settings using the Update Settings API.

You can build your own alias dictionary in the following way:

{
     "alias_dictionary": {
       	  ...
          "Under30PromotionAccount": [
               "bank account",
               "account",
               "young"
          ],
          "young promotion": [
             	 "bank account",
               "account",
               "young"
          ],
          ...
     }
}
  • The keys of the alias_dictionary are the keywords for which you want to create alias. These are the words that you expect to be in your documents, in our previous example Under30PromotionAccount would be one of these.
  • The values of the dictionary are terms that you expect to be in your queries and you want to consider as if they were the keywords. In our previous example any term such as bank account, account, and young will count as Under30PromotionAccount, Alias terms are however penalized a little with respect to query terms.

Both the keywords and the alias can be monograms or bigrams.

Example query: What are the promotions for young people?
Results containing "Under30PromotionAccount" will presumably compare in the results

There is no need to differentiate between singular/plural words.
Alias are global for the library, if you enable the use of them each query will consider alias.
Every time you want to update your alias dictionary you should reupload the whole alias dictionary containing all the keywords and alias.

If you disable the use of the alias from the settings, your alias dictionary will be kept and in case you want to reactivate the use of alias, the terms that you already specified will be considered for your searches.

Info

If you want to check the status of your Library you can use the Info API.
The only parameter is the verbose flag:

curl --request GET \
     --url 'https://api.pigro.ai/info?verbose=true' \
     --header 'Accept: application/json' \
     --header 'x-api-key: your32CharactersAPIkeyTokenHere_'

If set to False (default) the response will contain only the information about the number of documents uploaded and trained and if the library can be searched:

{
  "n_docs_uploaded": 100,
  "n_docs_trained": 90,
  "can_search?": True
}

If set to True the response will also contain the ids of the documents uploaded and trained

{
	"n_docs_uploaded": 10,
	"docs_uploaded_ids": [1, 2, "CacXhsLL", 4, "id_05", 6, "07", 8, "8-bis", 10],
	"n_docs_trained": 9,
	"docs_trained_ids": [1, 2, "CacXhsLL", 4, "id_05", 6, "07", 8, "8-bis"],
	"can_search?": True
}