John Snow Labs improves natural language processing solution
John Snow Labs, developer of the Spark NLP library, is releasing Spark NLP version 2.2, improving accuracy and enabling new use cases as prioritized by customers and the community.
Major new features include OCR based coordinate highlighting, BERT embeddings refactoring and tuning, new tools for accuracy evaluation in Python, and more.
This includes:
- Named Entity Recognition with deep learning now has `includeConfidence` param that returns confidence scores on prediction metadata.
- Named Entity Recognition with deep learning approach now has `enableOutputLog` outputs training metric logs to file, making it easier to track and optimize long model training runs.
- OCRHelper now returns a coordinate positions matrix for text converted from PDF documents.
- A new annotator called PositionFinder consumes OCRHelper positions to return rectangle coordinates for CHUNK annotator types. This enables visualizing where each chunk originally came from in a PDF.
- The evaluation module now also ported to Python. This provides accuracy metrics for each epoch in a machine learning or deep learning training run for new NLP models.
- WordEmbeddings now include coverage metadata information. Two new static functions `withCoverageColumn` and `overallCoverage` offer metric analysis.
- A new parameter in BERT `poolingLayer` allows for polling layer selection. This has shown to improve accuracy for some domain-specific NLP use cases.
For more information about this news, visit www.johnsnowlabs.com.