Saturday, 25 March 2017

How to process PDF file in PDI pentaho Kettle

Process PDF files in Pentaho kettle.
--------------------------------------------
Prerequisite :- 
 Pdf file reader java/jar file required 
Please Download this from below location

https://drive.google.com/file/d/0B_j1hJxesvxdSVhVUXloVkJybUE/view?usp=sharing

How to read pdf files through Pentaho PDI Kettle.

1.Use Get PDF File Names step to take name of pdf files.
2. you use wild card if if you want to process multiple files (.*.pdf).
3.now use copy rows to result to put pdf filename and their location to stream
   so that this can be used as parameter in next transformation.
4. Create variable and assign value of  pdf filename and their location on them.(refer image below)

5.Now use shell script step to execute below command
java -jar ${path}\..\pdfjar\pdf_parsing.jar ExtractText ${path}\${short_filename}  ${path}\..\Output\text\${short_filename}.txt
you can use hard coded path as well
example:
java -jar D:\1_pdfparsing\pdfjar\pdf_parsing.jar ExtractText D:\1_pdfparsing\pdfFiles\abhi.pdf D:\1_pdfparsing\pdfFiles.txt

6.In PDI use shell script step to execute above command.

Download working copy of above example from download PDI PDF process

https://drive.google.com/file/d/0B_j1hJxesvxdSVhVUXloVkJybUE/view?usp=sharing







4. Create variable and assign value of  pdf filename and their location on them.


5.Now use shell script step to execute below command
java -jar ${path}\..\pdfjar\pdf_parsing.jar ExtractText ${path}\${short_filename}  ${path}\..\Output\text\${short_filename}.txt
you can use hard coded path as well
example:
java -jar D:\1_pdfparsing\pdfjar\pdf_parsing.jar ExtractText D:\1_pdfparsing\pdfFiles\abhi.pdf D:\1_pdfparsing\pdfFiles.txt







In PDI use shell script step to execute above command





Include all step in one Job





Download working copy of above example from download PDI PDF process

https://drive.google.com/file/d/0B_j1hJxesvxdSVhVUXloVkJybUE/view?usp=sharing





Monday, 15 August 2016

Export data to PDF through PDI

Export  any data which is in PDI to PDF.

Please follow below step to execute
 Download related files from below location
https://drive.google.com/file/d/0B_j1hJxesvxdM3BMZXBLNWxoczg/view?usp=sharing
1.Extract Zip file
2.Open "Abhishek_pdf_as_output_test.ktr" in pdi
3.Run this transformation
4. see output as PDF  in "output" folder with name “abhi.pdf”.
5.if you want to Modify/Add/Remove data open  "source.ktr" go to “Data grid “ step and under Data Tab  modify  data ,then save it.
after that Run "Abhishek_pdf_as_output_test.ktr" in pdi.

6.see output as PDF  in "output" folder with name “abhi.pdf”.

Note:- Pentaho report (.prpt) file need to create based on output which you want in PDF.

Download related files from below location
https://drive.google.com/file/d/0B_j1hJxesvxdM3BMZXBLNWxoczg/view?usp=sharing

Tuesday, 12 May 2015

Change Date Capture (CDC) in Pentaho Kettle

CDC
-----
In databaseschange data capture (CDC) is a set of software design patterns used to determine (and track) the data that has changed so that action can be taken using the changed data. Also, Change data capture (CDC) is an approach to data integration that is based on the identification, capture and delivery of the changes made to enterprise data sources.
CDC solutions occur most often in data-warehouse environments since capturing and preserving the state of data across time is one of the core functions of a data warehouse, but CDC can be utilized in any database or data repository system.
---wikipedia


If you are not familiar with CDC, here a brief summary: CDC allows you to capture all the data from the source system that has changed since the last time you ran the ETL process. Wikipedia gives a more detailed overview in case you are interested. In the easiest case you have some sort of timestamp or datetime column that you can use for this purpose. The following example will be based on a simple sales table that has a date column which we can use for CDC:






Note that this is dataset has actually quite old data, something that we have to keep in mind. What happens if new data gets inserted later on? More on this after the basics.



Open your favourite SQL Client (and start your MySQL server if it is not running yet) and issue following SQL statements:



USE
test
;


DROP TABLE IF EXISTS
`sales`
;


CREATE TABLE
`sales`
(
`date` DATETIME,
`product_type` VARCHAR(45),
`sales` INT(255)
)
;


INSERT INTO
`sales`
VALUES
('2010-01-20 00:00:00','Shoes',234),
('2010-01-20 00:00:00','Cheese',456),
('2010-01-21 00:00:00','Shoes',256),
('2010-01-21 00:00:00','Cheese',156),
('2010-01-22 00:00:00','Shoes',535),
('2010-01-23 00:00:00','Cheese',433)
;



SELECT
*
FROM
`sales`
;



Now the way to get the timeframes for querying the data is to use the Get System Info step:

  1. Open Kettle and create a new transformation
  2. Drag and drop a Get System Info step on the canvas. You can find it in the Input folder.
  3. Double click on it and populate the names column in the grid with start_date and end_date.
  4. For the type choose start date range (Transformation) and end date range (Transformation) respectively


Click on Preview rows and you will see that Kettle displays a rather old datetime for the startdate (1899/12/31 23:00:00) and the current datetime for the enddate. Now this makes perfectly sense as with first run of your transfromation you want to get all the raw data from your source system. In case you don’t want this, I’ll show you a trick later on on how to change this.
Now this is an easy approach for CDC. You can feed the start and end date from the Get System Info step to a Table Input step in example and use the start and end date in the WHERE clause of your SQL query:


SELECT 
date 
, product_type 
, sales 
FROM sales 
WHERE 
date>=? AND 
date<? 
;

The question marks will be replaced on execution by the start and end date (but make sure they are defined in this order in the Get System Info step).

Let's add some more steps to our transformation:

Make sure that you enable Replace variables in script? and choose the Get System Info step for Insert data from step. The hop between the Get System Info step and the Table Input step now also displays an info icon. 

Setting up Logging
Kettle provides various levels of logging, the transformation logging being the one with the highest level. We will only look at the transformation logging here. 


  1. Press CTRL+T and the Transformation properties dialog will be displayed. 
  2. Click on the Logging tab and then highlight Transformation on the left hand side. 
  3. Our log will be placed in a database table, hence provide the info for the data connection and table name.
  4. We only want to keep the log records for the last 30 days, hence we set Log record timeout (in days) to 30.
  5. It’s best to keep all the logging fields (less maintenance)
  6. Provide step names for some of the fields, in example:

  • LINES_INPUT: specify the input step that represents the amount of imported data best. 
  • LINES_OUTPUT: specify the output step that represents the amount of imported data best

  1. Press the SQL button and Kettle will automatically generate the DDL (the create table statement) for you. Just press Execute and your logging table will be created on your specified database. Nice and easy!


So now we are all set and can run our transformation and see what’s happening. Click the Play button to execute the transformation. If your transformation executed successfully, close the transformation and open it again, then on the bottom click on the Execution History tab and you will so the logging information. (Kettle automatically reads the data from the table we just created).



Now pay attention to the Date Range Start and Date Range End field: It includes the data from our Get System Info step. Now execute the transformation again and see what happens (Click the Refresh button):



  • You can clearly see that Kettle new choose the CDC end datetime (which is the start datetime) of the last transformation execution for the start date. The last transformation execution has the Batch ID 0 in the screenshot above.
  • The CDC end date is set to the start date of the current transformation execution.

A safe approach to CDC

The CDC approach mentioned above is actually quite dangerous if your source data table doesn't get filled consistently. A safer approach is to use the maximum date of your source data as the CDC end date. Kettle provides this functionality out of the box: In the Transformation properties you can find a Dates tab. Click on it and provide following details of your fact table:


  1. Maxdate Connection: the database connection
  2. Maxdate table: the table name of your source table
  3. Maxdate field: the datetime field in your source table that you use for CDC
  4. Maxdate offset (seconds): If there are some inconsistencies in your source data, you can set a negative offset here. In example: -120 means that 120 seconds will be deducted from the max end date for CDC


As you can see from the screenshot above, I referenced again our main input table. So whatever the maximum date in this table is, Kettle will now use it for CDC. Open MySQL Client and delete all the records from our transformation log table, then start the transformation. As you can see from the Execution History Kettle now use the maximum date of our data set.



What more to say? Well, if you are unhappy with the first timeframe that Kettle chooses, just run the transformation once without any raw data input step, open a SQL client and update the end date with a date of your liking (this end date should be the start date that you want for the next execution). The next time you run your transformation, this datetime will be used for the start of your timeframe. 

References

http://diethardsteiner.blogspot.in/

Thursday, 26 March 2015

How to create CDE Dash Board in Pentaho

1. Log into the Pentaho User Console, and launch the CDE
You can launch the editor from the PUC home page, from the PUC menu (File > New > CDE Dashboard), or by clicking on the CDE icon in the PUC toolbar.




2. Save the dashboard
Give the Dashboard a name and save it in the solution folder of your choice, and then close the Dashboard.

3. Edit the dashboard
Refresh the console by clicking F5 (or the keyboard shortcut Ctrl + R), find the Dashboard, right-click it and select Edit. The Dashboard editor will appear.

4. Design the Layout
Select the Layout perspective and define the dashboard layout according to your sketch.




5. Add the datasources
Select the Datasource perspective and define the datasources that will feed your Dashboard.



Create Parameter using Generic component.
use below Query to populate  continents from World Database (Mysql)

select Distinct Continent from world.country






select name,Population from world.country
where Continent = ${Conti}
limit 5





6. Add and configure the components
Select the Component perspective and add the components that make up your Dashboard: the visual elements, the parameters, and eventually some scripts.




7. Preview your work
Click on Preview regularly to see how your work is progressing. Repeat steps 5-8 until you are satisfied with your final dashboard. Don't forget to save your work from time to time.



Download Dash board Sample file of Above Example from Below link

https://drive.google.com/file/d/0B_j1hJxesvxddEkzbWowUnFYVm8/view?usp=sharing

Please use My SQL with World Databse
you can download world database free from Mysql web site.






Wednesday, 18 June 2014

Pentaho BA Server Performance Tips

Pentaho BA Server Performance Tips
 
The Pentaho BA Server ships in a condition designed to work well for the majority of customers. However, deployments
that drift toward opposite extremes -- very large and very small -- will need to adjust certain settings, and possibly even
remove certain unused functionality, in order to achieve the desired performance goals without adding hardware.
Read through the subsections below and decide which ones apply to your scenario.
 
Move Pentaho Managed Data Sources to JNDI
 
Most production BI environments have finely-tuned data sources for reporting and analysis. If you haven't done any
data warehouse performance-tuning, you may want to consult
Pentaho Analysis (Mondrian) Performance Tips
on page
20
for basic advice before proceeding.
Pentaho provides a Data Source Wizard in the Pentaho User Console and a data source dialogue in the Pentaho
Enterprise Console that enable business users to develop rapid prototype data sources for ad hoc reporting and
analysis. This is a great way to get off the ground quickly, but they are "quick and dirty" and not performant. For
maximum performance, you should establish your own JNDI data connections at the Web application server level, and
tune them for your database.
JNDI data sources can be configured for Pentaho client tools by adding connection details to the
~/.pentaho/
simple-jndi/default.properties
file on Linux, or the
%userprofile%\.pentaho\simple-jndi
\default.properties
file on Windows. Design Studio requires that connection details be added to
/pentaho-
solutions/system/simple-jndi/jdbc.properties
as well.
Manual Cleanup of the /tmp Directory
Every time you generate content on the BA Server, temporary files are created on the local file system in the
/
pentaho-solutions/system/tmp/
directory. In some cases, the BA Server may not properly purge that temporary
content, leaving behind orphaned artifacts that can slowly build up and reduce performance on the volume that contains
the pentaho-solutions directory. One way to address this is to mount the /tmp directory on a separate volume, thereby
siphoning off all disk thrash associated with creating new content. However, you will still have to perform a manual
garbage collection procedure on this directory on a regular basis. You can accomplish this via a script that runs through
your system scheduler; it should be safe to delete any content files in this directory that are more than a week old.
 
Memory Optimization for the Geo Service Plugin
 
The Pentaho Geo Service enables Geo Map visualizations in Analyzer.
If you do not use Analyzer
or are sure that you are not using the Geo Service, you can free up approximately 600MB
of RAM by removing the Geo Service plugin. Simply shut down the BA Server and delete the
/pentaho/server/
biserver-ee/pentaho-solutions/system/pentaho-geo/
directory.
If you are a heavy user of the Geo Service
, update the cache setting for
pentaho-geo-municipality
in the
ehcache.xml
. This file can be found in the
/pentaho/server/biserver-ee/tomcat/webapps/pentaho/WEB-
INF/classes
directory.
<cache
name="pentaho-geo-municipality"
maxElementsInMemory="125000"
eternal="false"
overflowToDisk="true"
timeToIdleSeconds="0"
timeToLiveSeconds="0"
diskPersistent="false"
diskExpiryThreadIntervalSeconds="120"
/>
The default setting,
maxElementsInMemory="125000"
, is a relatively low number and might need to be increased
if you are using the Geo Service. Pentaho suggests a setting between
275000
and
700000
for a heavy usage. When
|
Pentaho BA Server Performance Tips
|
7
you increase the
maxElementsInMemory
setting, less memory is available for other resources. To accomodate higher
maxElementsInMemory
settings, increase the maximum memory allocated to the JVM running Pentaho processes.
The default setting is 768m that there are no memory leaks or zombie sessions inherent in the BI Platform.
 
Please Refer below PDF for more Detail
 
https://drive.google.com/file/d/0B_j1hJxesvxdbDdEZWhkbTZmQW8/edit?usp=sharing

Abhishek